Data Organization — Storage Systems and Data Management Guide 1.0 documentation
Data Organization — Storage Systems and Data Management Guide 1.0 documentation
Data Organization — Storage Systems and Data Management Guide 1.0 documentation
Data Organization
Understanding Data Organization
Data organization refers to the systematic arrangement and structuring of data to optimize storage efficiency, access speed, and management ease. Proper data organization is crucial for system performance and data integrity.
Data Organization Fundamentals
What is Data Organization?
Data organization encompasses:
Data Structure: How data is arranged and stored
Data Classification: Categorizing data by type, importance, and access patterns
Data Hierarchy: Organizing data in logical levels and relationships
Data Access Patterns: Understanding how data is accessed and used
Data Lifecycle: Managing data from creation to deletion
Data Classification Types
By Access Frequency
Hot Data: Frequently accessed, requires fast storage (SSD)
Warm Data: Occasionally accessed, standard performance storage
Cold Data: Rarely accessed, slower but cost-effective storage
Archival Data: Long-term retention, very infrequent access
By Data Type
Structured Data: Databases, spreadsheets, organized formats
Semi-structured Data: JSON, XML, log files
Unstructured Data: Documents, images, videos, emails
By Criticality
Critical Data: Business-essential, requires high availability
Important Data: Significant but not critical
Standard Data: Regular business data
Temporary Data: Short-term, disposable data
Ubuntu 22.04 Data Organization Tools
Directory Structure Best Practices
# Standard Ubuntu directory structure
/
├── home/ # User data
├── var/ # Variable data (logs, databases)
├── usr/ # User programs and libraries
├── opt/ # Optional software packages
├── srv/ # Service data
├── tmp/ # Temporary files
└── mnt/ # Mount points for external storage
# Recommended user data organization
/home/username/
├── Documents/ # Personal documents
├── Projects/ # Development projects
├── Media/ # Photos, videos, music
├── Archive/ # Old/archived files
├── Backup/ # Local backups
└── Work/ # Work-related files
File Organization Commands
# Create organized directory structure
mkdir -p ~/Documents/{Personal,Work}/{2024,2023,Archive}
mkdir -p ~/Projects/{Active,Completed,Archive}
mkdir -p ~/Media/{Photos,Videos,Music}/{2024,2023,Archive}
# File organization by date
find ~/Downloads -type f -newermt "2024-01-01" ! -newermt "2024-12-31" \
-exec mkdir -p ~/Archive/2024/ \; -exec mv {} ~/Archive/2024/ \;
# Organize files by extension
cd ~/Downloads
for ext in pdf doc docx txt; do
mkdir -p ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')
find . -name "*.$ext" -exec mv {} ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')/ \;
done
# Find and organize duplicate files
fdupes -r ~/Documents -d
Data Backup and Archival
Backup Strategies
# 3-2-1 Backup Rule Implementation
# 3 copies, 2 different media types, 1 offsite
# Local backup (rsync)
rsync -avH --delete ~/Documents/ /backup/documents/
# External drive backup
rsync -avH --delete ~/Documents/ /media/external/backup/documents/
# Cloud backup (rclone)
rclone sync ~/Documents/ cloud:backup/documents/
# Compressed archive backup
tar -czf /backup/documents_$(date +%Y%m%d).tar.gz ~/Documents/
Automated Data Organization
# Create organization script
cat > ~/bin/organize_data.sh << 'EOF'
#!/bin/bash
# Organize downloads by file type
organize_downloads() {
cd ~/Downloads
# Create directories
mkdir -p {Documents,Images,Videos,Music,Archives,Software}
# Move files by type
find . -maxdepth 1 -type f \( -iname "*.pdf" -o -iname "*.doc" -o -iname "*.docx" -o -iname "*.txt" \) -exec mv {} Documents/ \;
find . -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.gif" -o -iname "*.jpeg" \) -exec mv {} Images/ \;
find . -maxdepth 1 -type f \( -iname "*.mp4" -o -iname "*.avi" -o -iname "*.mkv" \) -exec mv {} Videos/ \;
find . -maxdepth 1 -type f \( -iname "*.mp3" -o -iname "*.flac" -o -iname "*.wav" \) -exec mv {} Music/ \;
find . -maxdepth 1 -type f \( -iname "*.zip" -o -iname "*.tar.gz" -o -iname "*.rar" \) -exec mv {} Archives/ \;
find . -maxdepth 1 -type f \( -iname "*.deb" -o -iname "*.appimage" \) -exec mv {} Software/ \;
}
# Clean old temporary files
clean_temp() {
find ~/Downloads -type f -mtime +30 -delete
find ~/.cache -type f -mtime +7 -delete
find /tmp -user $USER -type f -mtime +1 -delete 2>/dev/null
}
# Archive old files
archive_old() {
find ~/Documents -type f -mtime +365 -exec mkdir -p ~/Archive/$(date +%Y) \; -exec mv {} ~/Archive/$(date +%Y)/ \;
}
# Execute functions
organize_downloads
clean_temp
archive_old
echo "Data organization completed: $(date)"
EOF
chmod +x ~/bin/organize_data.sh
# Schedule with cron
echo "0 2 * * * ~/bin/organize_data.sh" | crontab -
Data Compression and Deduplication
File Compression Techniques
# Install compression tools
sudo apt install p7zip-full rar unrar gzip bzip2 xz-utils
# Create compressed archives
# gzip (fast, good compression)
tar -czf archive.tar.gz /path/to/data
# bzip2 (better compression, slower)
tar -cjf archive.tar.bz2 /path/to/data
# xz (best compression, slowest)
tar -cJf archive.tar.xz /path/to/data
# 7zip (excellent compression)
7z a -t7z -mx=9 archive.7z /path/to/data
# Compare compression ratios
for method in gz bz2 xz 7z; do
echo "Compressing with $method..."
case $method in
gz) tar -czf test.tar.gz ~/Documents ;;
bz2) tar -cjf test.tar.bz2 ~/Documents ;;
xz) tar -cJf test.tar.xz ~/Documents ;;
7z) 7z a test.7z ~/Documents ;;
esac
echo "$method: $(ls -lh test.*$method* | awk '{print $5}')"
done
Duplicate File Management
# Install deduplication tools
sudo apt install fdupes rdfind
# Find duplicates with fdupes
fdupes -r ~/Documents
# Find and delete duplicates interactively
fdupes -r -d ~/Documents
# Find duplicates with rdfind
rdfind ~/Documents
# Advanced duplicate handling with rdfind
rdfind -makehardlinks true ~/Documents
rdfind -makeresultsfile false -makehardlinks true ~/Documents
Data Monitoring and Analysis
Storage Usage Analysis
# Analyze disk usage by directory
ncdu /home
# Find largest files
find /home -type f -exec du -h {} + | sort -rh | head -20
# Analyze file types and sizes
find /home -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
# Monitor storage usage trends
cat > ~/bin/storage_trend.sh << 'EOF'
#!/bin/bash
LOGFILE="/var/log/storage_usage.log"
echo "$(date): $(df -h / | tail -1 | awk '{print $5}')" >> $LOGFILE
# Generate weekly report
if [ "$(date +%u)" -eq 1 ]; then
echo "Weekly Storage Report - $(date)" > /tmp/storage_report.txt
tail -7 $LOGFILE >> /tmp/storage_report.txt
mail -s "Storage Usage Report" admin@example.com < /tmp/storage_report.txt
fi
EOF
chmod +x ~/bin/storage_trend.sh
echo "0 0 * * * ~/bin/storage_trend.sh" | crontab -
File System Metadata Analysis
# Analyze inode usage
df -i
# Find directories with most files
find /home -type d -exec sh -c 'echo "$(find "$1" -maxdepth 1 | wc -l) $1"' _ {} \; | sort -rn | head -10
# Analyze file access patterns
find /home -type f -atime -7 | wc -l # Files accessed in last 7 days
find /home -type f -mtime -7 | wc -l # Files modified in last 7 days
find /home -type f -atime +365 | wc -l # Files not accessed in over a year
Frequently Asked Questions
Q: How should I organize my home directory in Ubuntu 22.04?
A: Follow this recommended structure:
~/
├── Documents/
│ ├── Personal/
│ ├── Work/
│ └── Archive/
├── Projects/
│ ├── Active/
│ ├── Completed/
│ └── Learning/
├── Media/
│ ├── Photos/
│ ├── Videos/
│ └── Music/
├── Downloads/
│ └── (temporary, clean regularly)
├── Backup/
│ └── (local backup copies)
└── Scripts/
└── (personal automation scripts)
Q: What’s the best way to handle duplicate files?
A: Use these strategies:
# 1. Find duplicates without deleting
fdupes -r ~/Documents
# 2. Interactive deletion
fdupes -r -d ~/Documents
# 3. Automatic deletion (keep first occurrence)
fdupes -r -f ~/Documents | grep -v '^$' | xargs rm
# 4. Convert to hard links (saves space)
rdfind -makehardlinks true ~/Documents
Q: How can I automate file organization?
A: Create automated organization scripts:
# File organizer by extension
#!/bin/bash
organize_by_extension() {
local source_dir="$1"
local target_dir="$2"
cd "$source_dir"
for file in *.*; do
if [ -f "$file" ]; then
ext="${file##*.}"
ext_upper=$(echo "$ext" | tr '[:lower:]' '[:upper:]')
mkdir -p "$target_dir/$ext_upper"
mv "$file" "$target_dir/$ext_upper/"
fi
done
}
# Usage
organize_by_extension ~/Downloads ~/Downloads/Organized
Q: How do I implement a data retention policy?
A: Create a retention management system:
#!/bin/bash
# data_retention.sh
# Define retention periods (days)
TEMP_RETENTION=7
LOG_RETENTION=90
BACKUP_RETENTION=365
ARCHIVE_RETENTION=2555 # 7 years
# Clean temporary files
find /tmp -type f -mtime +$TEMP_RETENTION -delete
find ~/.cache -type f -mtime +$TEMP_RETENTION -delete
# Archive old logs
find /var/log -name "*.log" -mtime +$LOG_RETENTION -gzip
# Remove old backups
find /backup -name "*.tar.gz" -mtime +$BACKUP_RETENTION -delete
# Alert for very old archives
find /archive -type f -mtime +$ARCHIVE_RETENTION -ls | \
mail -s "Files exceeding retention policy" admin@example.com
Coding Examples
Python Data Organization Manager
#!/usr/bin/env python3
"""
Advanced data organization manager for Ubuntu 22.04
"""
import os
import shutil
import hashlib
import json
import time
import mimetypes
from pathlib import Path
from collections import defaultdict
import argparse
class DataOrganizer:
def __init__(self, base_path=None):
self.base_path = Path(base_path) if base_path else Path.home()
self.file_types = {
'documents': ['.pdf', '.doc', '.docx', '.txt', '.rtf', '.odt'],
'images': ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.svg'],
'videos': ['.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv', '.webm'],
'music': ['.mp3', '.flac', '.wav', '.ogg', '.m4a', '.aac'],
'archives': ['.zip', '.tar', '.gz', '.bz2', '.xz', '.rar', '.7z'],
'code': ['.py', '.js', '.html', '.css', '.java', '.cpp', '.c', '.php'],
'data': ['.csv', '.json', '.xml', '.sql', '.db', '.sqlite']
}
self.stats = defaultdict(int)
def get_file_hash(self, filepath):
"""Calculate MD5 hash of a file"""
hash_md5 = hashlib.md5()
try:
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
except (IOError, OSError):
return None
def find_duplicates(self, directory):
"""Find duplicate files in a directory"""
duplicates = defaultdict(list)
for root, dirs, files in os.walk(directory):
for file in files:
filepath = Path(root) / file
if filepath.is_file():
file_hash = self.get_file_hash(filepath)
if file_hash:
duplicates[file_hash].append(filepath)
# Filter out unique files
return {k: v for k, v in duplicates.items() if len(v) > 1}
def organize_by_type(self, source_dir, target_dir=None):
"""Organize files by type"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "Organized"
if not source_path.exists():
raise ValueError(f"Source directory {source_path} does not exist")
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file() and file_path.parent == source_path:
file_ext = file_path.suffix.lower()
# Determine file category
category = 'misc'
for cat, extensions in self.file_types.items():
if file_ext in extensions:
category = cat
break
# Create category directory
category_dir = target_path / category.title()
category_dir.mkdir(exist_ok=True)
# Move file
new_path = category_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = category_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats[f'moved_{category}'] += 1
print(f"Moved {file_path.name} to {category.title()}/")
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def organize_by_date(self, source_dir, target_dir=None):
"""Organize files by modification date"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "ByDate"
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file():
# Get modification time
mtime = file_path.stat().st_mtime
date_str = time.strftime('%Y/%m', time.localtime(mtime))
# Create date directory
date_dir = target_path / date_str
date_dir.mkdir(parents=True, exist_ok=True)
# Move file
new_path = date_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = date_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats['moved_by_date'] += 1
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def clean_empty_directories(self, directory):
"""Remove empty directories"""
removed_count = 0
for root, dirs, files in os.walk(directory, topdown=False):
for dir_name in dirs:
dir_path = Path(root) / dir_name
try:
if not any(dir_path.iterdir()):
dir_path.rmdir()
removed_count += 1
print(f"Removed empty directory: {dir_path}")
except OSError:
pass # Directory not empty or permission denied
self.stats['removed_empty_dirs'] = removed_count
def analyze_directory(self, directory):
"""Analyze directory structure and file distribution"""
analysis = {
'total_files': 0,
'total_size': 0,
'file_types': defaultdict(int),
'size_by_type': defaultdict(int),
'largest_files': [],
'oldest_files': [],
'newest_files': []
}
files_with_info = []
for root, dirs, files in os.walk(directory):
for file in files:
file_path = Path(root) / file
if file_path.is_file():
try:
stat_info = file_path.stat()
file_size = stat_info.st_size
file_ext = file_path.suffix.lower()
analysis['total_files'] += 1
analysis['total_size'] += file_size
analysis['file_types'][file_ext] += 1
analysis['size_by_type'][file_ext] += file_size
files_with_info.append({
'path': file_path,
'size': file_size,
'mtime': stat_info.st_mtime
})
except OSError:
continue
# Sort files for top lists
files_with_info.sort(key=lambda x: x['size'], reverse=True)
analysis['largest_files'] = [
{'path': str(f['path']), 'size_mb': f['size'] / (1024*1024)}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'])
analysis['oldest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'], reverse=True)
analysis['newest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
return analysis
def create_backup_structure(self, source_dir, backup_dir):
"""Create organized backup structure"""
source_path = Path(source_dir)
backup_path = Path(backup_dir)
timestamp = time.strftime('%Y%m%d_%H%M%S')
backup_target = backup_path / f"backup_{timestamp}"
backup_target.mkdir(parents=True, exist_ok=True)
# Copy files with organization
for category, extensions in self.file_types.items():
category_backup = backup_target / category
category_backup.mkdir(exist_ok=True)
for root, dirs, files in os.walk(source_path):
for file in files:
file_path = Path(root) / file
if file_path.suffix.lower() in extensions:
try:
shutil.copy2(file_path, category_backup)
self.stats[f'backed_up_{category}'] += 1
except (IOError, OSError) as e:
print(f"Error backing up {file_path}: {e}")
def generate_report(self):
"""Generate organization report"""
report = {
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
'statistics': dict(self.stats),
'summary': f"Processed {sum(self.stats.values())} operations"
}
return json.dumps(report, indent=2)
def main():
parser = argparse.ArgumentParser(description='Data Organization Manager')
parser.add_argument('command', choices=['organize', 'analyze', 'duplicates', 'backup', 'clean'])
parser.add_argument('source', help='Source directory')
parser.add_argument('--target', help='Target directory')
parser.add_argument('--by-type', action='store_true', help='Organize by file type')
parser.add_argument('--by-date', action='store_true', help='Organize by date')
args = parser.parse_args()
organizer = DataOrganizer()
if args.command == 'organize':
if args.by_type:
organizer.organize_by_type(args.source, args.target)
elif args.by_date:
organizer.organize_by_date(args.source, args.target)
else:
organizer.organize_by_type(args.source, args.target)
elif args.command == 'analyze':
analysis = organizer.analyze_directory(args.source)
print(json.dumps(analysis, indent=2, default=str))
elif args.command == 'duplicates':
duplicates = organizer.find_duplicates(args.source)
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, files in duplicates.items():
print(f"Hash {hash_val[:8]}...")
for file in files:
print(f" {file}")
elif args.command == 'backup':
if not args.target:
print("Target directory required for backup")
return
organizer.create_backup_structure(args.source, args.target)
elif args.command == 'clean':
organizer.clean_empty_directories(args.source)
print(organizer.generate_report())
if __name__ == "__main__":
main()
Bash File Organization System
#!/bin/bash
# advanced_file_organizer.sh - Advanced file organization system for Ubuntu 22.04
# Configuration
CONFIG_FILE="$HOME/.file_organizer.conf"
LOG_FILE="$HOME/.file_organizer.log"
# Default configuration
cat > "$CONFIG_FILE" << 'EOF'
# File Organization Configuration
ORGANIZE_DOWNLOADS=true
ORGANIZE_DESKTOP=true
AUTO_ARCHIVE_DAYS=365
DUPLICATE_ACTION=ask # ask, delete, link
BACKUP_ENABLED=true
BACKUP_DIR="$HOME/Backup"
# File type definitions
DOCUMENTS="pdf doc docx txt rtf odt"
IMAGES="jpg jpeg png gif bmp tiff svg"
VIDEOS="mp4 avi mkv mov wmv flv webm"
MUSIC="mp3 flac wav ogg m4a aac"
ARCHIVES="zip tar gz bz2 xz rar 7z"
CODE="py js html css java cpp c php"
DATA="csv json xml sql db sqlite"
EOF
# Load configuration
source "$CONFIG_FILE"
# Logging function
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
echo "$1"
}
# Create directory structure
create_structure() {
local base_dir="$1"
log_message "Creating directory structure in $base_dir"
mkdir -p "$base_dir"/{Documents,Images,Videos,Music,Archives,Code,Data,Misc}
mkdir -p "$base_dir"/Archive/{$(date +%Y),$(date -d 'last year' +%Y)}
mkdir -p "$base_dir"/Backup
log_message "Directory structure created"
}
# Get file category
get_file_category() {
local file="$1"
local extension="${file##*.}"
extension=$(echo "$extension" | tr '[:upper:]' '[:lower:]')
case " $DOCUMENTS " in *" $extension "*) echo "Documents"; return ;; esac
case " $IMAGES " in *" $extension "*) echo "Images"; return ;; esac
case " $VIDEOS " in *" $extension "*) echo "Videos"; return ;; esac
case " $MUSIC " in *" $extension "*) echo "Music"; return ;; esac
case " $ARCHIVES " in *" $extension "*) echo "Archives"; return ;; esac
case " $CODE " in *" $extension "*) echo "Code"; return ;; esac
case " $DATA " in *" $extension "*) echo "Data"; return ;; esac
echo "Misc"
}
# Organize files by type
organize_by_type() {
local source_dir="$1"
local target_dir="$2"
log_message "Organizing files from $source_dir to $target_dir"
create_structure "$target_dir"
local moved_count=0
find "$source_dir" -maxdepth 1 -type f | while read -r file; do
if [[ -f "$file" ]]; then
local filename=$(basename "$file")
local category=$(get_file_category "$filename")
local target_file="$target_dir/$category/$filename"
# Handle name conflicts
local counter=1
local base_name="${filename%.*}"
local extension="${filename##*.}"
while [[ -e "$target_file" ]]; do
if [[ "$extension" != "$filename" ]]; then
target_file="$target_dir/$category/${base_name}_${counter}.${extension}"
else
target_file="$target_dir/$category/${filename}_${counter}"
fi
((counter++))
done
if mv "$file" "$target_file"; then
log_message "Moved $filename to $category/"
((moved_count++))
else
log_message "Failed to move $filename"
fi
fi
done
log_message "Organized $moved_count files"
}
# Find and handle duplicates
handle_duplicates() {
local directory="$1"
local action="${2:-$DUPLICATE_ACTION}"
log_message "Finding duplicates in $directory"
if ! command -v fdupes &> /dev/null; then
log_message "Installing fdupes for duplicate detection"
sudo apt update && sudo apt install -y fdupes
fi
local duplicates_file="/tmp/duplicates_$$"
fdupes -r "$directory" > "$duplicates_file"
local duplicate_sets=0
local files_processed=0
while IFS= read -r line; do
if [[ -z "$line" ]]; then
((duplicate_sets++))
elif [[ -f "$line" ]]; then
case "$action" in
"delete")
if [[ $files_processed -gt 0 ]]; then
rm "$line"
log_message "Deleted duplicate: $line"
fi
;;
"link")
if [[ $files_processed -gt 0 ]]; then
local first_file=$(head -1 "$duplicates_file")
rm "$line"
ln "$first_file" "$line"
log_message "Linked duplicate: $line"
fi
;;
"ask")
if [[ $files_processed -gt 0 ]]; then
echo "Duplicate found: $line"
read -p "Delete this file? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
rm "$line"
log_message "User deleted duplicate: $line"
fi
fi
;;
esac
((files_processed++))
fi
done < "$duplicates_file"
rm "$duplicates_file"
log_message "Processed $duplicate_sets sets of duplicates"
}
# Archive old files
archive_old_files() {
local source_dir="$1"
local days="${2:-$AUTO_ARCHIVE_DAYS}"
log_message "Archiving files older than $days days from $source_dir"
local archive_dir="$source_dir/Archive/$(date +%Y)"
mkdir -p "$archive_dir"
local archived_count=0
find "$source_dir" -type f -mtime +$days ! -path "*/Archive/*" ! -path "*/Backup/*" | while read -r file; do
local relative_path="${file#$source_dir/}"
local archive_path="$archive_dir/$relative_path"
local archive_parent=$(dirname "$archive_path")
mkdir -p "$archive_parent"
if mv "$file" "$archive_path"; then
log_message "Archived: $relative_path"
((archived_count++))
fi
done
log_message "Archived $archived_count files"
}
# Create backup
create_backup() {
local source_dir="$1"
local backup_dir="${2:-$BACKUP_DIR}"
if [[ "$BACKUP_ENABLED" != "true" ]]; then
log_message "Backup disabled in configuration"
return
fi
log_message "Creating backup of $source_dir to $backup_dir"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_target="$backup_dir/backup_$timestamp"
mkdir -p "$backup_target"
if rsync -av --progress "$source_dir/" "$backup_target/"; then
log_message "Backup created successfully: $backup_target"
# Compress backup
if command -v tar &> /dev/null; then
tar -czf "$backup_target.tar.gz" -C "$backup_dir" "backup_$timestamp"
rm -rf "$backup_target"
log_message "Backup compressed: $backup_target.tar.gz"
fi
else
log_message "Backup failed"
fi
}
# Generate organization report
generate_report() {
local directory="$1"
local report_file="/tmp/organization_report_$(date +%Y%m%d).txt"
{
echo "File Organization Report"
echo "========================"
echo "Generated: $(date)"
echo "Directory: $directory"
echo ""
echo "Directory Structure:"
tree -d -L 2 "$directory" 2>/dev/null || find "$directory" -type d | head -20
echo ""
echo "File Count by Type:"
for type in Documents Images Videos Music Archives Code Data Misc; do
if [[ -d "$directory/$type" ]]; then
count=$(find "$directory/$type" -type f | wc -l)
echo "$type: $count files"
fi
done
echo ""
echo "Storage Usage:"
du -sh "$directory"/* 2>/dev/null | sort -hr
echo ""
echo "Recent Activity (from log):"
tail -20 "$LOG_FILE"
} > "$report_file"
echo "Report generated: $report_file"
# Email report if mail is configured
if command -v mail &> /dev/null; then
cat "$report_file" | mail -s "File Organization Report" "$USER@localhost"
fi
}
# Main execution
main() {
case "${1:-help}" in
"organize")
organize_by_type "${2:-$HOME/Downloads}" "${3:-$HOME/Organized}"
;;
"duplicates")
handle_duplicates "${2:-$HOME}" "${3:-ask}"
;;
"archive")
archive_old_files "${2:-$HOME}" "${3:-365}"
;;
"backup")
create_backup "${2:-$HOME}" "${3:-$BACKUP_DIR}"
;;
"report")
generate_report "${2:-$HOME}"
;;
"full")
# Full organization workflow
local target_dir="${2:-$HOME/Organized}"
log_message "Starting full organization workflow"
create_backup "$HOME/Downloads"
organize_by_type "$HOME/Downloads" "$target_dir"
handle_duplicates "$target_dir" "ask"
archive_old_files "$target_dir"
generate_report "$target_dir"
log_message "Full organization workflow completed"
;;
"config")
nano "$CONFIG_FILE"
;;
"help"|*)
echo "Advanced File Organizer for Ubuntu 22.04"
echo "Usage: $0 [command] [options]"
echo ""
echo "Commands:"
echo " organize [source] [target] - Organize files by type"
echo " duplicates [dir] [action] - Handle duplicate files (ask/delete/link)"
echo " archive [dir] [days] - Archive old files"
echo " backup [source] [target] - Create backup"
echo " report [directory] - Generate organization report"
echo " full [target] - Run complete organization workflow"
echo " config - Edit configuration"
echo ""
echo "Examples:"
echo " $0 organize ~/Downloads ~/Organized"
echo " $0 duplicates ~/Documents delete"
echo " $0 archive ~/Documents 180"
echo " $0 full ~/MyFiles"
;;
esac
}
# Run main function with all arguments
main "$@"
Best Practices for Data Organization
Strategic Planning
Data Classification Strategy:
Identify data types and access patterns
Define retention policies
Establish backup requirements
Plan for data growth
Directory Structure Design:
Use consistent naming conventions
Implement logical hierarchies
Plan for scalability
Document organization rules
Automation and Maintenance
Automated Organization:
# Daily organization cron job
0 2 * * * /home/user/bin/organize_data.sh
# Weekly duplicate cleanup
0 1 * * 0 fdupes -r -d /home/user/Documents
# Monthly archival
0 0 1 * * /home/user/bin/archive_old_files.sh
Monitoring and Alerts:
# Storage usage monitoring
if [ $(df / | tail -1 | awk '{print $5}' | sed 's/%//') -gt 85 ]; then
echo "Disk usage critical" | mail -s "Storage Alert" admin
fi
Security and Compliance
Data Protection:
Implement access controls
Use encryption for sensitive data
Regular security audits
Backup verification
Compliance Requirements:
Document data handling procedures
Implement retention policies
Audit trail maintenance
Regular compliance reviews
# Set up secure data handling
# Encrypt sensitive directories
sudo apt install ecryptfs-utils
ecryptfs-migrate-home -u username
# Set appropriate permissions
find /home/user/Documents -type f -exec chmod 640 {} \;
find /home/user/Documents -type d -exec chmod 750 {} \;
Data Organization — Storage Systems and Data Management Guide 1.0 documentation
Data Organization — Storage Systems and Data Management Guide 1.0 documentation
Data Organization
Understanding Data Organization
Data organization refers to the systematic arrangement and structuring of data to optimize storage efficiency, access speed, and management ease. Proper data organization is crucial for system performance and data integrity.
Data Organization Fundamentals
What is Data Organization?
Data organization encompasses:
Data Structure: How data is arranged and stored
Data Classification: Categorizing data by type, importance, and access patterns
Data Hierarchy: Organizing data in logical levels and relationships
Data Access Patterns: Understanding how data is accessed and used
Data Lifecycle: Managing data from creation to deletion
Data Classification Types
By Access Frequency
Hot Data: Frequently accessed, requires fast storage (SSD)
Warm Data: Occasionally accessed, standard performance storage
Cold Data: Rarely accessed, slower but cost-effective storage
Archival Data: Long-term retention, very infrequent access
By Data Type
Structured Data: Databases, spreadsheets, organized formats
Semi-structured Data: JSON, XML, log files
Unstructured Data: Documents, images, videos, emails
By Criticality
Critical Data: Business-essential, requires high availability
Important Data: Significant but not critical
Standard Data: Regular business data
Temporary Data: Short-term, disposable data
Ubuntu 22.04 Data Organization Tools
Directory Structure Best Practices
# Standard Ubuntu directory structure
/
├── home/ # User data
├── var/ # Variable data (logs, databases)
├── usr/ # User programs and libraries
├── opt/ # Optional software packages
├── srv/ # Service data
├── tmp/ # Temporary files
└── mnt/ # Mount points for external storage
# Recommended user data organization
/home/username/
├── Documents/ # Personal documents
├── Projects/ # Development projects
├── Media/ # Photos, videos, music
├── Archive/ # Old/archived files
├── Backup/ # Local backups
└── Work/ # Work-related files
File Organization Commands
# Create organized directory structure
mkdir -p ~/Documents/{Personal,Work}/{2024,2023,Archive}
mkdir -p ~/Projects/{Active,Completed,Archive}
mkdir -p ~/Media/{Photos,Videos,Music}/{2024,2023,Archive}
# File organization by date
find ~/Downloads -type f -newermt "2024-01-01" ! -newermt "2024-12-31" \
-exec mkdir -p ~/Archive/2024/ \; -exec mv {} ~/Archive/2024/ \;
# Organize files by extension
cd ~/Downloads
for ext in pdf doc docx txt; do
mkdir -p ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')
find . -name "*.$ext" -exec mv {} ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')/ \;
done
# Find and organize duplicate files
fdupes -r ~/Documents -d
Data Backup and Archival
Backup Strategies
# 3-2-1 Backup Rule Implementation
# 3 copies, 2 different media types, 1 offsite
# Local backup (rsync)
rsync -avH --delete ~/Documents/ /backup/documents/
# External drive backup
rsync -avH --delete ~/Documents/ /media/external/backup/documents/
# Cloud backup (rclone)
rclone sync ~/Documents/ cloud:backup/documents/
# Compressed archive backup
tar -czf /backup/documents_$(date +%Y%m%d).tar.gz ~/Documents/
Automated Data Organization
# Create organization script
cat > ~/bin/organize_data.sh << 'EOF'
#!/bin/bash
# Organize downloads by file type
organize_downloads() {
cd ~/Downloads
# Create directories
mkdir -p {Documents,Images,Videos,Music,Archives,Software}
# Move files by type
find . -maxdepth 1 -type f \( -iname "*.pdf" -o -iname "*.doc" -o -iname "*.docx" -o -iname "*.txt" \) -exec mv {} Documents/ \;
find . -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.gif" -o -iname "*.jpeg" \) -exec mv {} Images/ \;
find . -maxdepth 1 -type f \( -iname "*.mp4" -o -iname "*.avi" -o -iname "*.mkv" \) -exec mv {} Videos/ \;
find . -maxdepth 1 -type f \( -iname "*.mp3" -o -iname "*.flac" -o -iname "*.wav" \) -exec mv {} Music/ \;
find . -maxdepth 1 -type f \( -iname "*.zip" -o -iname "*.tar.gz" -o -iname "*.rar" \) -exec mv {} Archives/ \;
find . -maxdepth 1 -type f \( -iname "*.deb" -o -iname "*.appimage" \) -exec mv {} Software/ \;
}
# Clean old temporary files
clean_temp() {
find ~/Downloads -type f -mtime +30 -delete
find ~/.cache -type f -mtime +7 -delete
find /tmp -user $USER -type f -mtime +1 -delete 2>/dev/null
}
# Archive old files
archive_old() {
find ~/Documents -type f -mtime +365 -exec mkdir -p ~/Archive/$(date +%Y) \; -exec mv {} ~/Archive/$(date +%Y)/ \;
}
# Execute functions
organize_downloads
clean_temp
archive_old
echo "Data organization completed: $(date)"
EOF
chmod +x ~/bin/organize_data.sh
# Schedule with cron
echo "0 2 * * * ~/bin/organize_data.sh" | crontab -
Data Compression and Deduplication
File Compression Techniques
# Install compression tools
sudo apt install p7zip-full rar unrar gzip bzip2 xz-utils
# Create compressed archives
# gzip (fast, good compression)
tar -czf archive.tar.gz /path/to/data
# bzip2 (better compression, slower)
tar -cjf archive.tar.bz2 /path/to/data
# xz (best compression, slowest)
tar -cJf archive.tar.xz /path/to/data
# 7zip (excellent compression)
7z a -t7z -mx=9 archive.7z /path/to/data
# Compare compression ratios
for method in gz bz2 xz 7z; do
echo "Compressing with $method..."
case $method in
gz) tar -czf test.tar.gz ~/Documents ;;
bz2) tar -cjf test.tar.bz2 ~/Documents ;;
xz) tar -cJf test.tar.xz ~/Documents ;;
7z) 7z a test.7z ~/Documents ;;
esac
echo "$method: $(ls -lh test.*$method* | awk '{print $5}')"
done
Duplicate File Management
# Install deduplication tools
sudo apt install fdupes rdfind
# Find duplicates with fdupes
fdupes -r ~/Documents
# Find and delete duplicates interactively
fdupes -r -d ~/Documents
# Find duplicates with rdfind
rdfind ~/Documents
# Advanced duplicate handling with rdfind
rdfind -makehardlinks true ~/Documents
rdfind -makeresultsfile false -makehardlinks true ~/Documents
Data Monitoring and Analysis
Storage Usage Analysis
# Analyze disk usage by directory
ncdu /home
# Find largest files
find /home -type f -exec du -h {} + | sort -rh | head -20
# Analyze file types and sizes
find /home -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
# Monitor storage usage trends
cat > ~/bin/storage_trend.sh << 'EOF'
#!/bin/bash
LOGFILE="/var/log/storage_usage.log"
echo "$(date): $(df -h / | tail -1 | awk '{print $5}')" >> $LOGFILE
# Generate weekly report
if [ "$(date +%u)" -eq 1 ]; then
echo "Weekly Storage Report - $(date)" > /tmp/storage_report.txt
tail -7 $LOGFILE >> /tmp/storage_report.txt
mail -s "Storage Usage Report" admin@example.com < /tmp/storage_report.txt
fi
EOF
chmod +x ~/bin/storage_trend.sh
echo "0 0 * * * ~/bin/storage_trend.sh" | crontab -
File System Metadata Analysis
# Analyze inode usage
df -i
# Find directories with most files
find /home -type d -exec sh -c 'echo "$(find "$1" -maxdepth 1 | wc -l) $1"' _ {} \; | sort -rn | head -10
# Analyze file access patterns
find /home -type f -atime -7 | wc -l # Files accessed in last 7 days
find /home -type f -mtime -7 | wc -l # Files modified in last 7 days
find /home -type f -atime +365 | wc -l # Files not accessed in over a year
Frequently Asked Questions
Q: How should I organize my home directory in Ubuntu 22.04?
A: Follow this recommended structure:
~/
├── Documents/
│ ├── Personal/
│ ├── Work/
│ └── Archive/
├── Projects/
│ ├── Active/
│ ├── Completed/
│ └── Learning/
├── Media/
│ ├── Photos/
│ ├── Videos/
│ └── Music/
├── Downloads/
│ └── (temporary, clean regularly)
├── Backup/
│ └── (local backup copies)
└── Scripts/
└── (personal automation scripts)
Q: What’s the best way to handle duplicate files?
A: Use these strategies:
# 1. Find duplicates without deleting
fdupes -r ~/Documents
# 2. Interactive deletion
fdupes -r -d ~/Documents
# 3. Automatic deletion (keep first occurrence)
fdupes -r -f ~/Documents | grep -v '^$' | xargs rm
# 4. Convert to hard links (saves space)
rdfind -makehardlinks true ~/Documents
Q: How can I automate file organization?
A: Create automated organization scripts:
# File organizer by extension
#!/bin/bash
organize_by_extension() {
local source_dir="$1"
local target_dir="$2"
cd "$source_dir"
for file in *.*; do
if [ -f "$file" ]; then
ext="${file##*.}"
ext_upper=$(echo "$ext" | tr '[:lower:]' '[:upper:]')
mkdir -p "$target_dir/$ext_upper"
mv "$file" "$target_dir/$ext_upper/"
fi
done
}
# Usage
organize_by_extension ~/Downloads ~/Downloads/Organized
Q: How do I implement a data retention policy?
A: Create a retention management system:
#!/bin/bash
# data_retention.sh
# Define retention periods (days)
TEMP_RETENTION=7
LOG_RETENTION=90
BACKUP_RETENTION=365
ARCHIVE_RETENTION=2555 # 7 years
# Clean temporary files
find /tmp -type f -mtime +$TEMP_RETENTION -delete
find ~/.cache -type f -mtime +$TEMP_RETENTION -delete
# Archive old logs
find /var/log -name "*.log" -mtime +$LOG_RETENTION -gzip
# Remove old backups
find /backup -name "*.tar.gz" -mtime +$BACKUP_RETENTION -delete
# Alert for very old archives
find /archive -type f -mtime +$ARCHIVE_RETENTION -ls | \
mail -s "Files exceeding retention policy" admin@example.com
Coding Examples
Python Data Organization Manager
#!/usr/bin/env python3
"""
Advanced data organization manager for Ubuntu 22.04
"""
import os
import shutil
import hashlib
import json
import time
import mimetypes
from pathlib import Path
from collections import defaultdict
import argparse
class DataOrganizer:
def __init__(self, base_path=None):
self.base_path = Path(base_path) if base_path else Path.home()
self.file_types = {
'documents': ['.pdf', '.doc', '.docx', '.txt', '.rtf', '.odt'],
'images': ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.svg'],
'videos': ['.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv', '.webm'],
'music': ['.mp3', '.flac', '.wav', '.ogg', '.m4a', '.aac'],
'archives': ['.zip', '.tar', '.gz', '.bz2', '.xz', '.rar', '.7z'],
'code': ['.py', '.js', '.html', '.css', '.java', '.cpp', '.c', '.php'],
'data': ['.csv', '.json', '.xml', '.sql', '.db', '.sqlite']
}
self.stats = defaultdict(int)
def get_file_hash(self, filepath):
"""Calculate MD5 hash of a file"""
hash_md5 = hashlib.md5()
try:
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
except (IOError, OSError):
return None
def find_duplicates(self, directory):
"""Find duplicate files in a directory"""
duplicates = defaultdict(list)
for root, dirs, files in os.walk(directory):
for file in files:
filepath = Path(root) / file
if filepath.is_file():
file_hash = self.get_file_hash(filepath)
if file_hash:
duplicates[file_hash].append(filepath)
# Filter out unique files
return {k: v for k, v in duplicates.items() if len(v) > 1}
def organize_by_type(self, source_dir, target_dir=None):
"""Organize files by type"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "Organized"
if not source_path.exists():
raise ValueError(f"Source directory {source_path} does not exist")
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file() and file_path.parent == source_path:
file_ext = file_path.suffix.lower()
# Determine file category
category = 'misc'
for cat, extensions in self.file_types.items():
if file_ext in extensions:
category = cat
break
# Create category directory
category_dir = target_path / category.title()
category_dir.mkdir(exist_ok=True)
# Move file
new_path = category_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = category_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats[f'moved_{category}'] += 1
print(f"Moved {file_path.name} to {category.title()}/")
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def organize_by_date(self, source_dir, target_dir=None):
"""Organize files by modification date"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "ByDate"
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file():
# Get modification time
mtime = file_path.stat().st_mtime
date_str = time.strftime('%Y/%m', time.localtime(mtime))
# Create date directory
date_dir = target_path / date_str
date_dir.mkdir(parents=True, exist_ok=True)
# Move file
new_path = date_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = date_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats['moved_by_date'] += 1
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def clean_empty_directories(self, directory):
"""Remove empty directories"""
removed_count = 0
for root, dirs, files in os.walk(directory, topdown=False):
for dir_name in dirs:
dir_path = Path(root) / dir_name
try:
if not any(dir_path.iterdir()):
dir_path.rmdir()
removed_count += 1
print(f"Removed empty directory: {dir_path}")
except OSError:
pass # Directory not empty or permission denied
self.stats['removed_empty_dirs'] = removed_count
def analyze_directory(self, directory):
"""Analyze directory structure and file distribution"""
analysis = {
'total_files': 0,
'total_size': 0,
'file_types': defaultdict(int),
'size_by_type': defaultdict(int),
'largest_files': [],
'oldest_files': [],
'newest_files': []
}
files_with_info = []
for root, dirs, files in os.walk(directory):
for file in files:
file_path = Path(root) / file
if file_path.is_file():
try:
stat_info = file_path.stat()
file_size = stat_info.st_size
file_ext = file_path.suffix.lower()
analysis['total_files'] += 1
analysis['total_size'] += file_size
analysis['file_types'][file_ext] += 1
analysis['size_by_type'][file_ext] += file_size
files_with_info.append({
'path': file_path,
'size': file_size,
'mtime': stat_info.st_mtime
})
except OSError:
continue
# Sort files for top lists
files_with_info.sort(key=lambda x: x['size'], reverse=True)
analysis['largest_files'] = [
{'path': str(f['path']), 'size_mb': f['size'] / (1024*1024)}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'])
analysis['oldest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'], reverse=True)
analysis['newest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
return analysis
def create_backup_structure(self, source_dir, backup_dir):
"""Create organized backup structure"""
source_path = Path(source_dir)
backup_path = Path(backup_dir)
timestamp = time.strftime('%Y%m%d_%H%M%S')
backup_target = backup_path / f"backup_{timestamp}"
backup_target.mkdir(parents=True, exist_ok=True)
# Copy files with organization
for category, extensions in self.file_types.items():
category_backup = backup_target / category
category_backup.mkdir(exist_ok=True)
for root, dirs, files in os.walk(source_path):
for file in files:
file_path = Path(root) / file
if file_path.suffix.lower() in extensions:
try:
shutil.copy2(file_path, category_backup)
self.stats[f'backed_up_{category}'] += 1
except (IOError, OSError) as e:
print(f"Error backing up {file_path}: {e}")
def generate_report(self):
"""Generate organization report"""
report = {
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
'statistics': dict(self.stats),
'summary': f"Processed {sum(self.stats.values())} operations"
}
return json.dumps(report, indent=2)
def main():
parser = argparse.ArgumentParser(description='Data Organization Manager')
parser.add_argument('command', choices=['organize', 'analyze', 'duplicates', 'backup', 'clean'])
parser.add_argument('source', help='Source directory')
parser.add_argument('--target', help='Target directory')
parser.add_argument('--by-type', action='store_true', help='Organize by file type')
parser.add_argument('--by-date', action='store_true', help='Organize by date')
args = parser.parse_args()
organizer = DataOrganizer()
if args.command == 'organize':
if args.by_type:
organizer.organize_by_type(args.source, args.target)
elif args.by_date:
organizer.organize_by_date(args.source, args.target)
else:
organizer.organize_by_type(args.source, args.target)
elif args.command == 'analyze':
analysis = organizer.analyze_directory(args.source)
print(json.dumps(analysis, indent=2, default=str))
elif args.command == 'duplicates':
duplicates = organizer.find_duplicates(args.source)
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, files in duplicates.items():
print(f"Hash {hash_val[:8]}...")
for file in files:
print(f" {file}")
elif args.command == 'backup':
if not args.target:
print("Target directory required for backup")
return
organizer.create_backup_structure(args.source, args.target)
elif args.command == 'clean':
organizer.clean_empty_directories(args.source)
print(organizer.generate_report())
if __name__ == "__main__":
main()
Bash File Organization System
#!/bin/bash
# advanced_file_organizer.sh - Advanced file organization system for Ubuntu 22.04
# Configuration
CONFIG_FILE="$HOME/.file_organizer.conf"
LOG_FILE="$HOME/.file_organizer.log"
# Default configuration
cat > "$CONFIG_FILE" << 'EOF'
# File Organization Configuration
ORGANIZE_DOWNLOADS=true
ORGANIZE_DESKTOP=true
AUTO_ARCHIVE_DAYS=365
DUPLICATE_ACTION=ask # ask, delete, link
BACKUP_ENABLED=true
BACKUP_DIR="$HOME/Backup"
# File type definitions
DOCUMENTS="pdf doc docx txt rtf odt"
IMAGES="jpg jpeg png gif bmp tiff svg"
VIDEOS="mp4 avi mkv mov wmv flv webm"
MUSIC="mp3 flac wav ogg m4a aac"
ARCHIVES="zip tar gz bz2 xz rar 7z"
CODE="py js html css java cpp c php"
DATA="csv json xml sql db sqlite"
EOF
# Load configuration
source "$CONFIG_FILE"
# Logging function
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
echo "$1"
}
# Create directory structure
create_structure() {
local base_dir="$1"
log_message "Creating directory structure in $base_dir"
mkdir -p "$base_dir"/{Documents,Images,Videos,Music,Archives,Code,Data,Misc}
mkdir -p "$base_dir"/Archive/{$(date +%Y),$(date -d 'last year' +%Y)}
mkdir -p "$base_dir"/Backup
log_message "Directory structure created"
}
# Get file category
get_file_category() {
local file="$1"
local extension="${file##*.}"
extension=$(echo "$extension" | tr '[:upper:]' '[:lower:]')
case " $DOCUMENTS " in *" $extension "*) echo "Documents"; return ;; esac
case " $IMAGES " in *" $extension "*) echo "Images"; return ;; esac
case " $VIDEOS " in *" $extension "*) echo "Videos"; return ;; esac
case " $MUSIC " in *" $extension "*) echo "Music"; return ;; esac
case " $ARCHIVES " in *" $extension "*) echo "Archives"; return ;; esac
case " $CODE " in *" $extension "*) echo "Code"; return ;; esac
case " $DATA " in *" $extension "*) echo "Data"; return ;; esac
echo "Misc"
}
# Organize files by type
organize_by_type() {
local source_dir="$1"
local target_dir="$2"
log_message "Organizing files from $source_dir to $target_dir"
create_structure "$target_dir"
local moved_count=0
find "$source_dir" -maxdepth 1 -type f | while read -r file; do
if [[ -f "$file" ]]; then
local filename=$(basename "$file")
local category=$(get_file_category "$filename")
local target_file="$target_dir/$category/$filename"
# Handle name conflicts
local counter=1
local base_name="${filename%.*}"
local extension="${filename##*.}"
while [[ -e "$target_file" ]]; do
if [[ "$extension" != "$filename" ]]; then
target_file="$target_dir/$category/${base_name}_${counter}.${extension}"
else
target_file="$target_dir/$category/${filename}_${counter}"
fi
((counter++))
done
if mv "$file" "$target_file"; then
log_message "Moved $filename to $category/"
((moved_count++))
else
log_message "Failed to move $filename"
fi
fi
done
log_message "Organized $moved_count files"
}
# Find and handle duplicates
handle_duplicates() {
local directory="$1"
local action="${2:-$DUPLICATE_ACTION}"
log_message "Finding duplicates in $directory"
if ! command -v fdupes &> /dev/null; then
log_message "Installing fdupes for duplicate detection"
sudo apt update && sudo apt install -y fdupes
fi
local duplicates_file="/tmp/duplicates_$$"
fdupes -r "$directory" > "$duplicates_file"
local duplicate_sets=0
local files_processed=0
while IFS= read -r line; do
if [[ -z "$line" ]]; then
((duplicate_sets++))
elif [[ -f "$line" ]]; then
case "$action" in
"delete")
if [[ $files_processed -gt 0 ]]; then
rm "$line"
log_message "Deleted duplicate: $line"
fi
;;
"link")
if [[ $files_processed -gt 0 ]]; then
local first_file=$(head -1 "$duplicates_file")
rm "$line"
ln "$first_file" "$line"
log_message "Linked duplicate: $line"
fi
;;
"ask")
if [[ $files_processed -gt 0 ]]; then
echo "Duplicate found: $line"
read -p "Delete this file? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
rm "$line"
log_message "User deleted duplicate: $line"
fi
fi
;;
esac
((files_processed++))
fi
done < "$duplicates_file"
rm "$duplicates_file"
log_message "Processed $duplicate_sets sets of duplicates"
}
# Archive old files
archive_old_files() {
local source_dir="$1"
local days="${2:-$AUTO_ARCHIVE_DAYS}"
log_message "Archiving files older than $days days from $source_dir"
local archive_dir="$source_dir/Archive/$(date +%Y)"
mkdir -p "$archive_dir"
local archived_count=0
find "$source_dir" -type f -mtime +$days ! -path "*/Archive/*" ! -path "*/Backup/*" | while read -r file; do
local relative_path="${file#$source_dir/}"
local archive_path="$archive_dir/$relative_path"
local archive_parent=$(dirname "$archive_path")
mkdir -p "$archive_parent"
if mv "$file" "$archive_path"; then
log_message "Archived: $relative_path"
((archived_count++))
fi
done
log_message "Archived $archived_count files"
}
# Create backup
create_backup() {
local source_dir="$1"
local backup_dir="${2:-$BACKUP_DIR}"
if [[ "$BACKUP_ENABLED" != "true" ]]; then
log_message "Backup disabled in configuration"
return
fi
log_message "Creating backup of $source_dir to $backup_dir"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_target="$backup_dir/backup_$timestamp"
mkdir -p "$backup_target"
if rsync -av --progress "$source_dir/" "$backup_target/"; then
log_message "Backup created successfully: $backup_target"
# Compress backup
if command -v tar &> /dev/null; then
tar -czf "$backup_target.tar.gz" -C "$backup_dir" "backup_$timestamp"
rm -rf "$backup_target"
log_message "Backup compressed: $backup_target.tar.gz"
fi
else
log_message "Backup failed"
fi
}
# Generate organization report
generate_report() {
local directory="$1"
local report_file="/tmp/organization_report_$(date +%Y%m%d).txt"
{
echo "File Organization Report"
echo "========================"
echo "Generated: $(date)"
echo "Directory: $directory"
echo ""
echo "Directory Structure:"
tree -d -L 2 "$directory" 2>/dev/null || find "$directory" -type d | head -20
echo ""
echo "File Count by Type:"
for type in Documents Images Videos Music Archives Code Data Misc; do
if [[ -d "$directory/$type" ]]; then
count=$(find "$directory/$type" -type f | wc -l)
echo "$type: $count files"
fi
done
echo ""
echo "Storage Usage:"
du -sh "$directory"/* 2>/dev/null | sort -hr
echo ""
echo "Recent Activity (from log):"
tail -20 "$LOG_FILE"
} > "$report_file"
echo "Report generated: $report_file"
# Email report if mail is configured
if command -v mail &> /dev/null; then
cat "$report_file" | mail -s "File Organization Report" "$USER@localhost"
fi
}
# Main execution
main() {
case "${1:-help}" in
"organize")
organize_by_type "${2:-$HOME/Downloads}" "${3:-$HOME/Organized}"
;;
"duplicates")
handle_duplicates "${2:-$HOME}" "${3:-ask}"
;;
"archive")
archive_old_files "${2:-$HOME}" "${3:-365}"
;;
"backup")
create_backup "${2:-$HOME}" "${3:-$BACKUP_DIR}"
;;
"report")
generate_report "${2:-$HOME}"
;;
"full")
# Full organization workflow
local target_dir="${2:-$HOME/Organized}"
log_message "Starting full organization workflow"
create_backup "$HOME/Downloads"
organize_by_type "$HOME/Downloads" "$target_dir"
handle_duplicates "$target_dir" "ask"
archive_old_files "$target_dir"
generate_report "$target_dir"
log_message "Full organization workflow completed"
;;
"config")
nano "$CONFIG_FILE"
;;
"help"|*)
echo "Advanced File Organizer for Ubuntu 22.04"
echo "Usage: $0 [command] [options]"
echo ""
echo "Commands:"
echo " organize [source] [target] - Organize files by type"
echo " duplicates [dir] [action] - Handle duplicate files (ask/delete/link)"
echo " archive [dir] [days] - Archive old files"
echo " backup [source] [target] - Create backup"
echo " report [directory] - Generate organization report"
echo " full [target] - Run complete organization workflow"
echo " config - Edit configuration"
echo ""
echo "Examples:"
echo " $0 organize ~/Downloads ~/Organized"
echo " $0 duplicates ~/Documents delete"
echo " $0 archive ~/Documents 180"
echo " $0 full ~/MyFiles"
;;
esac
}
# Run main function with all arguments
main "$@"
Best Practices for Data Organization
Strategic Planning
Data Classification Strategy:
Identify data types and access patterns
Define retention policies
Establish backup requirements
Plan for data growth
Directory Structure Design:
Use consistent naming conventions
Implement logical hierarchies
Plan for scalability
Document organization rules
Automation and Maintenance
Automated Organization:
# Daily organization cron job
0 2 * * * /home/user/bin/organize_data.sh
# Weekly duplicate cleanup
0 1 * * 0 fdupes -r -d /home/user/Documents
# Monthly archival
0 0 1 * * /home/user/bin/archive_old_files.sh
Monitoring and Alerts:
# Storage usage monitoring
if [ $(df / | tail -1 | awk '{print $5}' | sed 's/%//') -gt 85 ]; then
echo "Disk usage critical" | mail -s "Storage Alert" admin
fi
Security and Compliance
Data Protection:
Implement access controls
Use encryption for sensitive data
Regular security audits
Backup verification
Compliance Requirements:
Document data handling procedures
Implement retention policies
Audit trail maintenance
Regular compliance reviews
# Set up secure data handling
# Encrypt sensitive directories
sudo apt install ecryptfs-utils
ecryptfs-migrate-home -u username
# Set appropriate permissions
find /home/user/Documents -type f -exec chmod 640 {} \;
find /home/user/Documents -type d -exec chmod 750 {} \;
Data Organization — Storage Systems and Data Management Guide 1.0 documentation
Data Organization
Understanding Data Organization
Data organization refers to the systematic arrangement and structuring of data to optimize storage efficiency, access speed, and management ease. Proper data organization is crucial for system performance and data integrity.
Data Organization Fundamentals
What is Data Organization?
Data organization encompasses:
Data Structure: How data is arranged and stored
Data Classification: Categorizing data by type, importance, and access patterns
Data Hierarchy: Organizing data in logical levels and relationships
Data Access Patterns: Understanding how data is accessed and used
Data Lifecycle: Managing data from creation to deletion
Data Classification Types
By Access Frequency
Hot Data: Frequently accessed, requires fast storage (SSD)
Warm Data: Occasionally accessed, standard performance storage
Cold Data: Rarely accessed, slower but cost-effective storage
Archival Data: Long-term retention, very infrequent access
By Data Type
Structured Data: Databases, spreadsheets, organized formats
Semi-structured Data: JSON, XML, log files
Unstructured Data: Documents, images, videos, emails
By Criticality
Critical Data: Business-essential, requires high availability
Important Data: Significant but not critical
Standard Data: Regular business data
Temporary Data: Short-term, disposable data
Ubuntu 22.04 Data Organization Tools
Directory Structure Best Practices
# Standard Ubuntu directory structure
/
├── home/ # User data
├── var/ # Variable data (logs, databases)
├── usr/ # User programs and libraries
├── opt/ # Optional software packages
├── srv/ # Service data
├── tmp/ # Temporary files
└── mnt/ # Mount points for external storage
# Recommended user data organization
/home/username/
├── Documents/ # Personal documents
├── Projects/ # Development projects
├── Media/ # Photos, videos, music
├── Archive/ # Old/archived files
├── Backup/ # Local backups
└── Work/ # Work-related files
File Organization Commands
# Create organized directory structure
mkdir -p ~/Documents/{Personal,Work}/{2024,2023,Archive}
mkdir -p ~/Projects/{Active,Completed,Archive}
mkdir -p ~/Media/{Photos,Videos,Music}/{2024,2023,Archive}
# File organization by date
find ~/Downloads -type f -newermt "2024-01-01" ! -newermt "2024-12-31" \
-exec mkdir -p ~/Archive/2024/ \; -exec mv {} ~/Archive/2024/ \;
# Organize files by extension
cd ~/Downloads
for ext in pdf doc docx txt; do
mkdir -p ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')
find . -name "*.$ext" -exec mv {} ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')/ \;
done
# Find and organize duplicate files
fdupes -r ~/Documents -d
Data Backup and Archival
Backup Strategies
# 3-2-1 Backup Rule Implementation
# 3 copies, 2 different media types, 1 offsite
# Local backup (rsync)
rsync -avH --delete ~/Documents/ /backup/documents/
# External drive backup
rsync -avH --delete ~/Documents/ /media/external/backup/documents/
# Cloud backup (rclone)
rclone sync ~/Documents/ cloud:backup/documents/
# Compressed archive backup
tar -czf /backup/documents_$(date +%Y%m%d).tar.gz ~/Documents/
Automated Data Organization
# Create organization script
cat > ~/bin/organize_data.sh << 'EOF'
#!/bin/bash
# Organize downloads by file type
organize_downloads() {
cd ~/Downloads
# Create directories
mkdir -p {Documents,Images,Videos,Music,Archives,Software}
# Move files by type
find . -maxdepth 1 -type f \( -iname "*.pdf" -o -iname "*.doc" -o -iname "*.docx" -o -iname "*.txt" \) -exec mv {} Documents/ \;
find . -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.gif" -o -iname "*.jpeg" \) -exec mv {} Images/ \;
find . -maxdepth 1 -type f \( -iname "*.mp4" -o -iname "*.avi" -o -iname "*.mkv" \) -exec mv {} Videos/ \;
find . -maxdepth 1 -type f \( -iname "*.mp3" -o -iname "*.flac" -o -iname "*.wav" \) -exec mv {} Music/ \;
find . -maxdepth 1 -type f \( -iname "*.zip" -o -iname "*.tar.gz" -o -iname "*.rar" \) -exec mv {} Archives/ \;
find . -maxdepth 1 -type f \( -iname "*.deb" -o -iname "*.appimage" \) -exec mv {} Software/ \;
}
# Clean old temporary files
clean_temp() {
find ~/Downloads -type f -mtime +30 -delete
find ~/.cache -type f -mtime +7 -delete
find /tmp -user $USER -type f -mtime +1 -delete 2>/dev/null
}
# Archive old files
archive_old() {
find ~/Documents -type f -mtime +365 -exec mkdir -p ~/Archive/$(date +%Y) \; -exec mv {} ~/Archive/$(date +%Y)/ \;
}
# Execute functions
organize_downloads
clean_temp
archive_old
echo "Data organization completed: $(date)"
EOF
chmod +x ~/bin/organize_data.sh
# Schedule with cron
echo "0 2 * * * ~/bin/organize_data.sh" | crontab -
Data Compression and Deduplication
File Compression Techniques
# Install compression tools
sudo apt install p7zip-full rar unrar gzip bzip2 xz-utils
# Create compressed archives
# gzip (fast, good compression)
tar -czf archive.tar.gz /path/to/data
# bzip2 (better compression, slower)
tar -cjf archive.tar.bz2 /path/to/data
# xz (best compression, slowest)
tar -cJf archive.tar.xz /path/to/data
# 7zip (excellent compression)
7z a -t7z -mx=9 archive.7z /path/to/data
# Compare compression ratios
for method in gz bz2 xz 7z; do
echo "Compressing with $method..."
case $method in
gz) tar -czf test.tar.gz ~/Documents ;;
bz2) tar -cjf test.tar.bz2 ~/Documents ;;
xz) tar -cJf test.tar.xz ~/Documents ;;
7z) 7z a test.7z ~/Documents ;;
esac
echo "$method: $(ls -lh test.*$method* | awk '{print $5}')"
done
Duplicate File Management
# Install deduplication tools
sudo apt install fdupes rdfind
# Find duplicates with fdupes
fdupes -r ~/Documents
# Find and delete duplicates interactively
fdupes -r -d ~/Documents
# Find duplicates with rdfind
rdfind ~/Documents
# Advanced duplicate handling with rdfind
rdfind -makehardlinks true ~/Documents
rdfind -makeresultsfile false -makehardlinks true ~/Documents
Data Monitoring and Analysis
Storage Usage Analysis
# Analyze disk usage by directory
ncdu /home
# Find largest files
find /home -type f -exec du -h {} + | sort -rh | head -20
# Analyze file types and sizes
find /home -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
# Monitor storage usage trends
cat > ~/bin/storage_trend.sh << 'EOF'
#!/bin/bash
LOGFILE="/var/log/storage_usage.log"
echo "$(date): $(df -h / | tail -1 | awk '{print $5}')" >> $LOGFILE
# Generate weekly report
if [ "$(date +%u)" -eq 1 ]; then
echo "Weekly Storage Report - $(date)" > /tmp/storage_report.txt
tail -7 $LOGFILE >> /tmp/storage_report.txt
mail -s "Storage Usage Report" admin@example.com < /tmp/storage_report.txt
fi
EOF
chmod +x ~/bin/storage_trend.sh
echo "0 0 * * * ~/bin/storage_trend.sh" | crontab -
File System Metadata Analysis
# Analyze inode usage
df -i
# Find directories with most files
find /home -type d -exec sh -c 'echo "$(find "$1" -maxdepth 1 | wc -l) $1"' _ {} \; | sort -rn | head -10
# Analyze file access patterns
find /home -type f -atime -7 | wc -l # Files accessed in last 7 days
find /home -type f -mtime -7 | wc -l # Files modified in last 7 days
find /home -type f -atime +365 | wc -l # Files not accessed in over a year
Frequently Asked Questions
Q: How should I organize my home directory in Ubuntu 22.04?
A: Follow this recommended structure:
~/
├── Documents/
│ ├── Personal/
│ ├── Work/
│ └── Archive/
├── Projects/
│ ├── Active/
│ ├── Completed/
│ └── Learning/
├── Media/
│ ├── Photos/
│ ├── Videos/
│ └── Music/
├── Downloads/
│ └── (temporary, clean regularly)
├── Backup/
│ └── (local backup copies)
└── Scripts/
└── (personal automation scripts)
Q: What’s the best way to handle duplicate files?
A: Use these strategies:
# 1. Find duplicates without deleting
fdupes -r ~/Documents
# 2. Interactive deletion
fdupes -r -d ~/Documents
# 3. Automatic deletion (keep first occurrence)
fdupes -r -f ~/Documents | grep -v '^$' | xargs rm
# 4. Convert to hard links (saves space)
rdfind -makehardlinks true ~/Documents
Q: How can I automate file organization?
A: Create automated organization scripts:
# File organizer by extension
#!/bin/bash
organize_by_extension() {
local source_dir="$1"
local target_dir="$2"
cd "$source_dir"
for file in *.*; do
if [ -f "$file" ]; then
ext="${file##*.}"
ext_upper=$(echo "$ext" | tr '[:lower:]' '[:upper:]')
mkdir -p "$target_dir/$ext_upper"
mv "$file" "$target_dir/$ext_upper/"
fi
done
}
# Usage
organize_by_extension ~/Downloads ~/Downloads/Organized
Q: How do I implement a data retention policy?
A: Create a retention management system:
#!/bin/bash
# data_retention.sh
# Define retention periods (days)
TEMP_RETENTION=7
LOG_RETENTION=90
BACKUP_RETENTION=365
ARCHIVE_RETENTION=2555 # 7 years
# Clean temporary files
find /tmp -type f -mtime +$TEMP_RETENTION -delete
find ~/.cache -type f -mtime +$TEMP_RETENTION -delete
# Archive old logs
find /var/log -name "*.log" -mtime +$LOG_RETENTION -gzip
# Remove old backups
find /backup -name "*.tar.gz" -mtime +$BACKUP_RETENTION -delete
# Alert for very old archives
find /archive -type f -mtime +$ARCHIVE_RETENTION -ls | \
mail -s "Files exceeding retention policy" admin@example.com
Coding Examples
Python Data Organization Manager
#!/usr/bin/env python3
"""
Advanced data organization manager for Ubuntu 22.04
"""
import os
import shutil
import hashlib
import json
import time
import mimetypes
from pathlib import Path
from collections import defaultdict
import argparse
class DataOrganizer:
def __init__(self, base_path=None):
self.base_path = Path(base_path) if base_path else Path.home()
self.file_types = {
'documents': ['.pdf', '.doc', '.docx', '.txt', '.rtf', '.odt'],
'images': ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.svg'],
'videos': ['.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv', '.webm'],
'music': ['.mp3', '.flac', '.wav', '.ogg', '.m4a', '.aac'],
'archives': ['.zip', '.tar', '.gz', '.bz2', '.xz', '.rar', '.7z'],
'code': ['.py', '.js', '.html', '.css', '.java', '.cpp', '.c', '.php'],
'data': ['.csv', '.json', '.xml', '.sql', '.db', '.sqlite']
}
self.stats = defaultdict(int)
def get_file_hash(self, filepath):
"""Calculate MD5 hash of a file"""
hash_md5 = hashlib.md5()
try:
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
except (IOError, OSError):
return None
def find_duplicates(self, directory):
"""Find duplicate files in a directory"""
duplicates = defaultdict(list)
for root, dirs, files in os.walk(directory):
for file in files:
filepath = Path(root) / file
if filepath.is_file():
file_hash = self.get_file_hash(filepath)
if file_hash:
duplicates[file_hash].append(filepath)
# Filter out unique files
return {k: v for k, v in duplicates.items() if len(v) > 1}
def organize_by_type(self, source_dir, target_dir=None):
"""Organize files by type"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "Organized"
if not source_path.exists():
raise ValueError(f"Source directory {source_path} does not exist")
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file() and file_path.parent == source_path:
file_ext = file_path.suffix.lower()
# Determine file category
category = 'misc'
for cat, extensions in self.file_types.items():
if file_ext in extensions:
category = cat
break
# Create category directory
category_dir = target_path / category.title()
category_dir.mkdir(exist_ok=True)
# Move file
new_path = category_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = category_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats[f'moved_{category}'] += 1
print(f"Moved {file_path.name} to {category.title()}/")
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def organize_by_date(self, source_dir, target_dir=None):
"""Organize files by modification date"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "ByDate"
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file():
# Get modification time
mtime = file_path.stat().st_mtime
date_str = time.strftime('%Y/%m', time.localtime(mtime))
# Create date directory
date_dir = target_path / date_str
date_dir.mkdir(parents=True, exist_ok=True)
# Move file
new_path = date_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = date_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats['moved_by_date'] += 1
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def clean_empty_directories(self, directory):
"""Remove empty directories"""
removed_count = 0
for root, dirs, files in os.walk(directory, topdown=False):
for dir_name in dirs:
dir_path = Path(root) / dir_name
try:
if not any(dir_path.iterdir()):
dir_path.rmdir()
removed_count += 1
print(f"Removed empty directory: {dir_path}")
except OSError:
pass # Directory not empty or permission denied
self.stats['removed_empty_dirs'] = removed_count
def analyze_directory(self, directory):
"""Analyze directory structure and file distribution"""
analysis = {
'total_files': 0,
'total_size': 0,
'file_types': defaultdict(int),
'size_by_type': defaultdict(int),
'largest_files': [],
'oldest_files': [],
'newest_files': []
}
files_with_info = []
for root, dirs, files in os.walk(directory):
for file in files:
file_path = Path(root) / file
if file_path.is_file():
try:
stat_info = file_path.stat()
file_size = stat_info.st_size
file_ext = file_path.suffix.lower()
analysis['total_files'] += 1
analysis['total_size'] += file_size
analysis['file_types'][file_ext] += 1
analysis['size_by_type'][file_ext] += file_size
files_with_info.append({
'path': file_path,
'size': file_size,
'mtime': stat_info.st_mtime
})
except OSError:
continue
# Sort files for top lists
files_with_info.sort(key=lambda x: x['size'], reverse=True)
analysis['largest_files'] = [
{'path': str(f['path']), 'size_mb': f['size'] / (1024*1024)}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'])
analysis['oldest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'], reverse=True)
analysis['newest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
return analysis
def create_backup_structure(self, source_dir, backup_dir):
"""Create organized backup structure"""
source_path = Path(source_dir)
backup_path = Path(backup_dir)
timestamp = time.strftime('%Y%m%d_%H%M%S')
backup_target = backup_path / f"backup_{timestamp}"
backup_target.mkdir(parents=True, exist_ok=True)
# Copy files with organization
for category, extensions in self.file_types.items():
category_backup = backup_target / category
category_backup.mkdir(exist_ok=True)
for root, dirs, files in os.walk(source_path):
for file in files:
file_path = Path(root) / file
if file_path.suffix.lower() in extensions:
try:
shutil.copy2(file_path, category_backup)
self.stats[f'backed_up_{category}'] += 1
except (IOError, OSError) as e:
print(f"Error backing up {file_path}: {e}")
def generate_report(self):
"""Generate organization report"""
report = {
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
'statistics': dict(self.stats),
'summary': f"Processed {sum(self.stats.values())} operations"
}
return json.dumps(report, indent=2)
def main():
parser = argparse.ArgumentParser(description='Data Organization Manager')
parser.add_argument('command', choices=['organize', 'analyze', 'duplicates', 'backup', 'clean'])
parser.add_argument('source', help='Source directory')
parser.add_argument('--target', help='Target directory')
parser.add_argument('--by-type', action='store_true', help='Organize by file type')
parser.add_argument('--by-date', action='store_true', help='Organize by date')
args = parser.parse_args()
organizer = DataOrganizer()
if args.command == 'organize':
if args.by_type:
organizer.organize_by_type(args.source, args.target)
elif args.by_date:
organizer.organize_by_date(args.source, args.target)
else:
organizer.organize_by_type(args.source, args.target)
elif args.command == 'analyze':
analysis = organizer.analyze_directory(args.source)
print(json.dumps(analysis, indent=2, default=str))
elif args.command == 'duplicates':
duplicates = organizer.find_duplicates(args.source)
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, files in duplicates.items():
print(f"Hash {hash_val[:8]}...")
for file in files:
print(f" {file}")
elif args.command == 'backup':
if not args.target:
print("Target directory required for backup")
return
organizer.create_backup_structure(args.source, args.target)
elif args.command == 'clean':
organizer.clean_empty_directories(args.source)
print(organizer.generate_report())
if __name__ == "__main__":
main()
Bash File Organization System
#!/bin/bash
# advanced_file_organizer.sh - Advanced file organization system for Ubuntu 22.04
# Configuration
CONFIG_FILE="$HOME/.file_organizer.conf"
LOG_FILE="$HOME/.file_organizer.log"
# Default configuration
cat > "$CONFIG_FILE" << 'EOF'
# File Organization Configuration
ORGANIZE_DOWNLOADS=true
ORGANIZE_DESKTOP=true
AUTO_ARCHIVE_DAYS=365
DUPLICATE_ACTION=ask # ask, delete, link
BACKUP_ENABLED=true
BACKUP_DIR="$HOME/Backup"
# File type definitions
DOCUMENTS="pdf doc docx txt rtf odt"
IMAGES="jpg jpeg png gif bmp tiff svg"
VIDEOS="mp4 avi mkv mov wmv flv webm"
MUSIC="mp3 flac wav ogg m4a aac"
ARCHIVES="zip tar gz bz2 xz rar 7z"
CODE="py js html css java cpp c php"
DATA="csv json xml sql db sqlite"
EOF
# Load configuration
source "$CONFIG_FILE"
# Logging function
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
echo "$1"
}
# Create directory structure
create_structure() {
local base_dir="$1"
log_message "Creating directory structure in $base_dir"
mkdir -p "$base_dir"/{Documents,Images,Videos,Music,Archives,Code,Data,Misc}
mkdir -p "$base_dir"/Archive/{$(date +%Y),$(date -d 'last year' +%Y)}
mkdir -p "$base_dir"/Backup
log_message "Directory structure created"
}
# Get file category
get_file_category() {
local file="$1"
local extension="${file##*.}"
extension=$(echo "$extension" | tr '[:upper:]' '[:lower:]')
case " $DOCUMENTS " in *" $extension "*) echo "Documents"; return ;; esac
case " $IMAGES " in *" $extension "*) echo "Images"; return ;; esac
case " $VIDEOS " in *" $extension "*) echo "Videos"; return ;; esac
case " $MUSIC " in *" $extension "*) echo "Music"; return ;; esac
case " $ARCHIVES " in *" $extension "*) echo "Archives"; return ;; esac
case " $CODE " in *" $extension "*) echo "Code"; return ;; esac
case " $DATA " in *" $extension "*) echo "Data"; return ;; esac
echo "Misc"
}
# Organize files by type
organize_by_type() {
local source_dir="$1"
local target_dir="$2"
log_message "Organizing files from $source_dir to $target_dir"
create_structure "$target_dir"
local moved_count=0
find "$source_dir" -maxdepth 1 -type f | while read -r file; do
if [[ -f "$file" ]]; then
local filename=$(basename "$file")
local category=$(get_file_category "$filename")
local target_file="$target_dir/$category/$filename"
# Handle name conflicts
local counter=1
local base_name="${filename%.*}"
local extension="${filename##*.}"
while [[ -e "$target_file" ]]; do
if [[ "$extension" != "$filename" ]]; then
target_file="$target_dir/$category/${base_name}_${counter}.${extension}"
else
target_file="$target_dir/$category/${filename}_${counter}"
fi
((counter++))
done
if mv "$file" "$target_file"; then
log_message "Moved $filename to $category/"
((moved_count++))
else
log_message "Failed to move $filename"
fi
fi
done
log_message "Organized $moved_count files"
}
# Find and handle duplicates
handle_duplicates() {
local directory="$1"
local action="${2:-$DUPLICATE_ACTION}"
log_message "Finding duplicates in $directory"
if ! command -v fdupes &> /dev/null; then
log_message "Installing fdupes for duplicate detection"
sudo apt update && sudo apt install -y fdupes
fi
local duplicates_file="/tmp/duplicates_$$"
fdupes -r "$directory" > "$duplicates_file"
local duplicate_sets=0
local files_processed=0
while IFS= read -r line; do
if [[ -z "$line" ]]; then
((duplicate_sets++))
elif [[ -f "$line" ]]; then
case "$action" in
"delete")
if [[ $files_processed -gt 0 ]]; then
rm "$line"
log_message "Deleted duplicate: $line"
fi
;;
"link")
if [[ $files_processed -gt 0 ]]; then
local first_file=$(head -1 "$duplicates_file")
rm "$line"
ln "$first_file" "$line"
log_message "Linked duplicate: $line"
fi
;;
"ask")
if [[ $files_processed -gt 0 ]]; then
echo "Duplicate found: $line"
read -p "Delete this file? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
rm "$line"
log_message "User deleted duplicate: $line"
fi
fi
;;
esac
((files_processed++))
fi
done < "$duplicates_file"
rm "$duplicates_file"
log_message "Processed $duplicate_sets sets of duplicates"
}
# Archive old files
archive_old_files() {
local source_dir="$1"
local days="${2:-$AUTO_ARCHIVE_DAYS}"
log_message "Archiving files older than $days days from $source_dir"
local archive_dir="$source_dir/Archive/$(date +%Y)"
mkdir -p "$archive_dir"
local archived_count=0
find "$source_dir" -type f -mtime +$days ! -path "*/Archive/*" ! -path "*/Backup/*" | while read -r file; do
local relative_path="${file#$source_dir/}"
local archive_path="$archive_dir/$relative_path"
local archive_parent=$(dirname "$archive_path")
mkdir -p "$archive_parent"
if mv "$file" "$archive_path"; then
log_message "Archived: $relative_path"
((archived_count++))
fi
done
log_message "Archived $archived_count files"
}
# Create backup
create_backup() {
local source_dir="$1"
local backup_dir="${2:-$BACKUP_DIR}"
if [[ "$BACKUP_ENABLED" != "true" ]]; then
log_message "Backup disabled in configuration"
return
fi
log_message "Creating backup of $source_dir to $backup_dir"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_target="$backup_dir/backup_$timestamp"
mkdir -p "$backup_target"
if rsync -av --progress "$source_dir/" "$backup_target/"; then
log_message "Backup created successfully: $backup_target"
# Compress backup
if command -v tar &> /dev/null; then
tar -czf "$backup_target.tar.gz" -C "$backup_dir" "backup_$timestamp"
rm -rf "$backup_target"
log_message "Backup compressed: $backup_target.tar.gz"
fi
else
log_message "Backup failed"
fi
}
# Generate organization report
generate_report() {
local directory="$1"
local report_file="/tmp/organization_report_$(date +%Y%m%d).txt"
{
echo "File Organization Report"
echo "========================"
echo "Generated: $(date)"
echo "Directory: $directory"
echo ""
echo "Directory Structure:"
tree -d -L 2 "$directory" 2>/dev/null || find "$directory" -type d | head -20
echo ""
echo "File Count by Type:"
for type in Documents Images Videos Music Archives Code Data Misc; do
if [[ -d "$directory/$type" ]]; then
count=$(find "$directory/$type" -type f | wc -l)
echo "$type: $count files"
fi
done
echo ""
echo "Storage Usage:"
du -sh "$directory"/* 2>/dev/null | sort -hr
echo ""
echo "Recent Activity (from log):"
tail -20 "$LOG_FILE"
} > "$report_file"
echo "Report generated: $report_file"
# Email report if mail is configured
if command -v mail &> /dev/null; then
cat "$report_file" | mail -s "File Organization Report" "$USER@localhost"
fi
}
# Main execution
main() {
case "${1:-help}" in
"organize")
organize_by_type "${2:-$HOME/Downloads}" "${3:-$HOME/Organized}"
;;
"duplicates")
handle_duplicates "${2:-$HOME}" "${3:-ask}"
;;
"archive")
archive_old_files "${2:-$HOME}" "${3:-365}"
;;
"backup")
create_backup "${2:-$HOME}" "${3:-$BACKUP_DIR}"
;;
"report")
generate_report "${2:-$HOME}"
;;
"full")
# Full organization workflow
local target_dir="${2:-$HOME/Organized}"
log_message "Starting full organization workflow"
create_backup "$HOME/Downloads"
organize_by_type "$HOME/Downloads" "$target_dir"
handle_duplicates "$target_dir" "ask"
archive_old_files "$target_dir"
generate_report "$target_dir"
log_message "Full organization workflow completed"
;;
"config")
nano "$CONFIG_FILE"
;;
"help"|*)
echo "Advanced File Organizer for Ubuntu 22.04"
echo "Usage: $0 [command] [options]"
echo ""
echo "Commands:"
echo " organize [source] [target] - Organize files by type"
echo " duplicates [dir] [action] - Handle duplicate files (ask/delete/link)"
echo " archive [dir] [days] - Archive old files"
echo " backup [source] [target] - Create backup"
echo " report [directory] - Generate organization report"
echo " full [target] - Run complete organization workflow"
echo " config - Edit configuration"
echo ""
echo "Examples:"
echo " $0 organize ~/Downloads ~/Organized"
echo " $0 duplicates ~/Documents delete"
echo " $0 archive ~/Documents 180"
echo " $0 full ~/MyFiles"
;;
esac
}
# Run main function with all arguments
main "$@"
Best Practices for Data Organization
Strategic Planning
Data Classification Strategy:
Identify data types and access patterns
Define retention policies
Establish backup requirements
Plan for data growth
Directory Structure Design:
Use consistent naming conventions
Implement logical hierarchies
Plan for scalability
Document organization rules
Automation and Maintenance
Automated Organization:
# Daily organization cron job
0 2 * * * /home/user/bin/organize_data.sh
# Weekly duplicate cleanup
0 1 * * 0 fdupes -r -d /home/user/Documents
# Monthly archival
0 0 1 * * /home/user/bin/archive_old_files.sh
Monitoring and Alerts:
# Storage usage monitoring
if [ $(df / | tail -1 | awk '{print $5}' | sed 's/%//') -gt 85 ]; then
echo "Disk usage critical" | mail -s "Storage Alert" admin
fi
Security and Compliance
Data Protection:
Implement access controls
Use encryption for sensitive data
Regular security audits
Backup verification
Compliance Requirements:
Document data handling procedures
Implement retention policies
Audit trail maintenance
Regular compliance reviews
# Set up secure data handling
# Encrypt sensitive directories
sudo apt install ecryptfs-utils
ecryptfs-migrate-home -u username
# Set appropriate permissions
find /home/user/Documents -type f -exec chmod 640 {} \;
find /home/user/Documents -type d -exec chmod 750 {} \;
Data Organization
Understanding Data Organization
Data organization refers to the systematic arrangement and structuring of data to optimize storage efficiency, access speed, and management ease. Proper data organization is crucial for system performance and data integrity.
Data Organization Fundamentals
What is Data Organization?
Data organization encompasses:
Data Structure: How data is arranged and stored
Data Classification: Categorizing data by type, importance, and access patterns
Data Hierarchy: Organizing data in logical levels and relationships
Data Access Patterns: Understanding how data is accessed and used
Data Lifecycle: Managing data from creation to deletion
Data Classification Types
By Access Frequency
Hot Data: Frequently accessed, requires fast storage (SSD)
Warm Data: Occasionally accessed, standard performance storage
Cold Data: Rarely accessed, slower but cost-effective storage
Archival Data: Long-term retention, very infrequent access
By Data Type
Structured Data: Databases, spreadsheets, organized formats
Semi-structured Data: JSON, XML, log files
Unstructured Data: Documents, images, videos, emails
By Criticality
Critical Data: Business-essential, requires high availability
Important Data: Significant but not critical
Standard Data: Regular business data
Temporary Data: Short-term, disposable data
Ubuntu 22.04 Data Organization Tools
Directory Structure Best Practices
# Standard Ubuntu directory structure
/
├── home/ # User data
├── var/ # Variable data (logs, databases)
├── usr/ # User programs and libraries
├── opt/ # Optional software packages
├── srv/ # Service data
├── tmp/ # Temporary files
└── mnt/ # Mount points for external storage
# Recommended user data organization
/home/username/
├── Documents/ # Personal documents
├── Projects/ # Development projects
├── Media/ # Photos, videos, music
├── Archive/ # Old/archived files
├── Backup/ # Local backups
└── Work/ # Work-related files
File Organization Commands
# Create organized directory structure
mkdir -p ~/Documents/{Personal,Work}/{2024,2023,Archive}
mkdir -p ~/Projects/{Active,Completed,Archive}
mkdir -p ~/Media/{Photos,Videos,Music}/{2024,2023,Archive}
# File organization by date
find ~/Downloads -type f -newermt "2024-01-01" ! -newermt "2024-12-31" \
-exec mkdir -p ~/Archive/2024/ \; -exec mv {} ~/Archive/2024/ \;
# Organize files by extension
cd ~/Downloads
for ext in pdf doc docx txt; do
mkdir -p ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')
find . -name "*.$ext" -exec mv {} ~/Documents/$(echo $ext | tr '[:lower:]' '[:upper:]')/ \;
done
# Find and organize duplicate files
fdupes -r ~/Documents -d
Data Backup and Archival
Backup Strategies
# 3-2-1 Backup Rule Implementation
# 3 copies, 2 different media types, 1 offsite
# Local backup (rsync)
rsync -avH --delete ~/Documents/ /backup/documents/
# External drive backup
rsync -avH --delete ~/Documents/ /media/external/backup/documents/
# Cloud backup (rclone)
rclone sync ~/Documents/ cloud:backup/documents/
# Compressed archive backup
tar -czf /backup/documents_$(date +%Y%m%d).tar.gz ~/Documents/
Automated Data Organization
# Create organization script
cat > ~/bin/organize_data.sh << 'EOF'
#!/bin/bash
# Organize downloads by file type
organize_downloads() {
cd ~/Downloads
# Create directories
mkdir -p {Documents,Images,Videos,Music,Archives,Software}
# Move files by type
find . -maxdepth 1 -type f \( -iname "*.pdf" -o -iname "*.doc" -o -iname "*.docx" -o -iname "*.txt" \) -exec mv {} Documents/ \;
find . -maxdepth 1 -type f \( -iname "*.jpg" -o -iname "*.png" -o -iname "*.gif" -o -iname "*.jpeg" \) -exec mv {} Images/ \;
find . -maxdepth 1 -type f \( -iname "*.mp4" -o -iname "*.avi" -o -iname "*.mkv" \) -exec mv {} Videos/ \;
find . -maxdepth 1 -type f \( -iname "*.mp3" -o -iname "*.flac" -o -iname "*.wav" \) -exec mv {} Music/ \;
find . -maxdepth 1 -type f \( -iname "*.zip" -o -iname "*.tar.gz" -o -iname "*.rar" \) -exec mv {} Archives/ \;
find . -maxdepth 1 -type f \( -iname "*.deb" -o -iname "*.appimage" \) -exec mv {} Software/ \;
}
# Clean old temporary files
clean_temp() {
find ~/Downloads -type f -mtime +30 -delete
find ~/.cache -type f -mtime +7 -delete
find /tmp -user $USER -type f -mtime +1 -delete 2>/dev/null
}
# Archive old files
archive_old() {
find ~/Documents -type f -mtime +365 -exec mkdir -p ~/Archive/$(date +%Y) \; -exec mv {} ~/Archive/$(date +%Y)/ \;
}
# Execute functions
organize_downloads
clean_temp
archive_old
echo "Data organization completed: $(date)"
EOF
chmod +x ~/bin/organize_data.sh
# Schedule with cron
echo "0 2 * * * ~/bin/organize_data.sh" | crontab -
Data Compression and Deduplication
File Compression Techniques
# Install compression tools
sudo apt install p7zip-full rar unrar gzip bzip2 xz-utils
# Create compressed archives
# gzip (fast, good compression)
tar -czf archive.tar.gz /path/to/data
# bzip2 (better compression, slower)
tar -cjf archive.tar.bz2 /path/to/data
# xz (best compression, slowest)
tar -cJf archive.tar.xz /path/to/data
# 7zip (excellent compression)
7z a -t7z -mx=9 archive.7z /path/to/data
# Compare compression ratios
for method in gz bz2 xz 7z; do
echo "Compressing with $method..."
case $method in
gz) tar -czf test.tar.gz ~/Documents ;;
bz2) tar -cjf test.tar.bz2 ~/Documents ;;
xz) tar -cJf test.tar.xz ~/Documents ;;
7z) 7z a test.7z ~/Documents ;;
esac
echo "$method: $(ls -lh test.*$method* | awk '{print $5}')"
done
Duplicate File Management
# Install deduplication tools
sudo apt install fdupes rdfind
# Find duplicates with fdupes
fdupes -r ~/Documents
# Find and delete duplicates interactively
fdupes -r -d ~/Documents
# Find duplicates with rdfind
rdfind ~/Documents
# Advanced duplicate handling with rdfind
rdfind -makehardlinks true ~/Documents
rdfind -makeresultsfile false -makehardlinks true ~/Documents
Data Monitoring and Analysis
Storage Usage Analysis
# Analyze disk usage by directory
ncdu /home
# Find largest files
find /home -type f -exec du -h {} + | sort -rh | head -20
# Analyze file types and sizes
find /home -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
# Monitor storage usage trends
cat > ~/bin/storage_trend.sh << 'EOF'
#!/bin/bash
LOGFILE="/var/log/storage_usage.log"
echo "$(date): $(df -h / | tail -1 | awk '{print $5}')" >> $LOGFILE
# Generate weekly report
if [ "$(date +%u)" -eq 1 ]; then
echo "Weekly Storage Report - $(date)" > /tmp/storage_report.txt
tail -7 $LOGFILE >> /tmp/storage_report.txt
mail -s "Storage Usage Report" admin@example.com < /tmp/storage_report.txt
fi
EOF
chmod +x ~/bin/storage_trend.sh
echo "0 0 * * * ~/bin/storage_trend.sh" | crontab -
File System Metadata Analysis
# Analyze inode usage
df -i
# Find directories with most files
find /home -type d -exec sh -c 'echo "$(find "$1" -maxdepth 1 | wc -l) $1"' _ {} \; | sort -rn | head -10
# Analyze file access patterns
find /home -type f -atime -7 | wc -l # Files accessed in last 7 days
find /home -type f -mtime -7 | wc -l # Files modified in last 7 days
find /home -type f -atime +365 | wc -l # Files not accessed in over a year
Frequently Asked Questions
Q: How should I organize my home directory in Ubuntu 22.04?
A: Follow this recommended structure:
~/
├── Documents/
│ ├── Personal/
│ ├── Work/
│ └── Archive/
├── Projects/
│ ├── Active/
│ ├── Completed/
│ └── Learning/
├── Media/
│ ├── Photos/
│ ├── Videos/
│ └── Music/
├── Downloads/
│ └── (temporary, clean regularly)
├── Backup/
│ └── (local backup copies)
└── Scripts/
└── (personal automation scripts)
Q: What’s the best way to handle duplicate files?
A: Use these strategies:
# 1. Find duplicates without deleting
fdupes -r ~/Documents
# 2. Interactive deletion
fdupes -r -d ~/Documents
# 3. Automatic deletion (keep first occurrence)
fdupes -r -f ~/Documents | grep -v '^$' | xargs rm
# 4. Convert to hard links (saves space)
rdfind -makehardlinks true ~/Documents
Q: How can I automate file organization?
A: Create automated organization scripts:
# File organizer by extension
#!/bin/bash
organize_by_extension() {
local source_dir="$1"
local target_dir="$2"
cd "$source_dir"
for file in *.*; do
if [ -f "$file" ]; then
ext="${file##*.}"
ext_upper=$(echo "$ext" | tr '[:lower:]' '[:upper:]')
mkdir -p "$target_dir/$ext_upper"
mv "$file" "$target_dir/$ext_upper/"
fi
done
}
# Usage
organize_by_extension ~/Downloads ~/Downloads/Organized
Q: How do I implement a data retention policy?
A: Create a retention management system:
#!/bin/bash
# data_retention.sh
# Define retention periods (days)
TEMP_RETENTION=7
LOG_RETENTION=90
BACKUP_RETENTION=365
ARCHIVE_RETENTION=2555 # 7 years
# Clean temporary files
find /tmp -type f -mtime +$TEMP_RETENTION -delete
find ~/.cache -type f -mtime +$TEMP_RETENTION -delete
# Archive old logs
find /var/log -name "*.log" -mtime +$LOG_RETENTION -gzip
# Remove old backups
find /backup -name "*.tar.gz" -mtime +$BACKUP_RETENTION -delete
# Alert for very old archives
find /archive -type f -mtime +$ARCHIVE_RETENTION -ls | \
mail -s "Files exceeding retention policy" admin@example.com
Coding Examples
Python Data Organization Manager
#!/usr/bin/env python3
"""
Advanced data organization manager for Ubuntu 22.04
"""
import os
import shutil
import hashlib
import json
import time
import mimetypes
from pathlib import Path
from collections import defaultdict
import argparse
class DataOrganizer:
def __init__(self, base_path=None):
self.base_path = Path(base_path) if base_path else Path.home()
self.file_types = {
'documents': ['.pdf', '.doc', '.docx', '.txt', '.rtf', '.odt'],
'images': ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.svg'],
'videos': ['.mp4', '.avi', '.mkv', '.mov', '.wmv', '.flv', '.webm'],
'music': ['.mp3', '.flac', '.wav', '.ogg', '.m4a', '.aac'],
'archives': ['.zip', '.tar', '.gz', '.bz2', '.xz', '.rar', '.7z'],
'code': ['.py', '.js', '.html', '.css', '.java', '.cpp', '.c', '.php'],
'data': ['.csv', '.json', '.xml', '.sql', '.db', '.sqlite']
}
self.stats = defaultdict(int)
def get_file_hash(self, filepath):
"""Calculate MD5 hash of a file"""
hash_md5 = hashlib.md5()
try:
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
except (IOError, OSError):
return None
def find_duplicates(self, directory):
"""Find duplicate files in a directory"""
duplicates = defaultdict(list)
for root, dirs, files in os.walk(directory):
for file in files:
filepath = Path(root) / file
if filepath.is_file():
file_hash = self.get_file_hash(filepath)
if file_hash:
duplicates[file_hash].append(filepath)
# Filter out unique files
return {k: v for k, v in duplicates.items() if len(v) > 1}
def organize_by_type(self, source_dir, target_dir=None):
"""Organize files by type"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "Organized"
if not source_path.exists():
raise ValueError(f"Source directory {source_path} does not exist")
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file() and file_path.parent == source_path:
file_ext = file_path.suffix.lower()
# Determine file category
category = 'misc'
for cat, extensions in self.file_types.items():
if file_ext in extensions:
category = cat
break
# Create category directory
category_dir = target_path / category.title()
category_dir.mkdir(exist_ok=True)
# Move file
new_path = category_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = category_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats[f'moved_{category}'] += 1
print(f"Moved {file_path.name} to {category.title()}/")
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def organize_by_date(self, source_dir, target_dir=None):
"""Organize files by modification date"""
source_path = Path(source_dir)
target_path = Path(target_dir) if target_dir else source_path / "ByDate"
target_path.mkdir(exist_ok=True)
for file_path in source_path.rglob('*'):
if file_path.is_file():
# Get modification time
mtime = file_path.stat().st_mtime
date_str = time.strftime('%Y/%m', time.localtime(mtime))
# Create date directory
date_dir = target_path / date_str
date_dir.mkdir(parents=True, exist_ok=True)
# Move file
new_path = date_dir / file_path.name
# Handle naming conflicts
counter = 1
original_name = new_path.stem
while new_path.exists():
new_path = date_dir / f"{original_name}_{counter}{file_path.suffix}"
counter += 1
try:
shutil.move(str(file_path), str(new_path))
self.stats['moved_by_date'] += 1
except (IOError, OSError) as e:
print(f"Error moving {file_path}: {e}")
def clean_empty_directories(self, directory):
"""Remove empty directories"""
removed_count = 0
for root, dirs, files in os.walk(directory, topdown=False):
for dir_name in dirs:
dir_path = Path(root) / dir_name
try:
if not any(dir_path.iterdir()):
dir_path.rmdir()
removed_count += 1
print(f"Removed empty directory: {dir_path}")
except OSError:
pass # Directory not empty or permission denied
self.stats['removed_empty_dirs'] = removed_count
def analyze_directory(self, directory):
"""Analyze directory structure and file distribution"""
analysis = {
'total_files': 0,
'total_size': 0,
'file_types': defaultdict(int),
'size_by_type': defaultdict(int),
'largest_files': [],
'oldest_files': [],
'newest_files': []
}
files_with_info = []
for root, dirs, files in os.walk(directory):
for file in files:
file_path = Path(root) / file
if file_path.is_file():
try:
stat_info = file_path.stat()
file_size = stat_info.st_size
file_ext = file_path.suffix.lower()
analysis['total_files'] += 1
analysis['total_size'] += file_size
analysis['file_types'][file_ext] += 1
analysis['size_by_type'][file_ext] += file_size
files_with_info.append({
'path': file_path,
'size': file_size,
'mtime': stat_info.st_mtime
})
except OSError:
continue
# Sort files for top lists
files_with_info.sort(key=lambda x: x['size'], reverse=True)
analysis['largest_files'] = [
{'path': str(f['path']), 'size_mb': f['size'] / (1024*1024)}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'])
analysis['oldest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
files_with_info.sort(key=lambda x: x['mtime'], reverse=True)
analysis['newest_files'] = [
{'path': str(f['path']), 'date': time.strftime('%Y-%m-%d', time.localtime(f['mtime']))}
for f in files_with_info[:10]
]
return analysis
def create_backup_structure(self, source_dir, backup_dir):
"""Create organized backup structure"""
source_path = Path(source_dir)
backup_path = Path(backup_dir)
timestamp = time.strftime('%Y%m%d_%H%M%S')
backup_target = backup_path / f"backup_{timestamp}"
backup_target.mkdir(parents=True, exist_ok=True)
# Copy files with organization
for category, extensions in self.file_types.items():
category_backup = backup_target / category
category_backup.mkdir(exist_ok=True)
for root, dirs, files in os.walk(source_path):
for file in files:
file_path = Path(root) / file
if file_path.suffix.lower() in extensions:
try:
shutil.copy2(file_path, category_backup)
self.stats[f'backed_up_{category}'] += 1
except (IOError, OSError) as e:
print(f"Error backing up {file_path}: {e}")
def generate_report(self):
"""Generate organization report"""
report = {
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
'statistics': dict(self.stats),
'summary': f"Processed {sum(self.stats.values())} operations"
}
return json.dumps(report, indent=2)
def main():
parser = argparse.ArgumentParser(description='Data Organization Manager')
parser.add_argument('command', choices=['organize', 'analyze', 'duplicates', 'backup', 'clean'])
parser.add_argument('source', help='Source directory')
parser.add_argument('--target', help='Target directory')
parser.add_argument('--by-type', action='store_true', help='Organize by file type')
parser.add_argument('--by-date', action='store_true', help='Organize by date')
args = parser.parse_args()
organizer = DataOrganizer()
if args.command == 'organize':
if args.by_type:
organizer.organize_by_type(args.source, args.target)
elif args.by_date:
organizer.organize_by_date(args.source, args.target)
else:
organizer.organize_by_type(args.source, args.target)
elif args.command == 'analyze':
analysis = organizer.analyze_directory(args.source)
print(json.dumps(analysis, indent=2, default=str))
elif args.command == 'duplicates':
duplicates = organizer.find_duplicates(args.source)
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, files in duplicates.items():
print(f"Hash {hash_val[:8]}...")
for file in files:
print(f" {file}")
elif args.command == 'backup':
if not args.target:
print("Target directory required for backup")
return
organizer.create_backup_structure(args.source, args.target)
elif args.command == 'clean':
organizer.clean_empty_directories(args.source)
print(organizer.generate_report())
if __name__ == "__main__":
main()
Bash File Organization System
#!/bin/bash
# advanced_file_organizer.sh - Advanced file organization system for Ubuntu 22.04
# Configuration
CONFIG_FILE="$HOME/.file_organizer.conf"
LOG_FILE="$HOME/.file_organizer.log"
# Default configuration
cat > "$CONFIG_FILE" << 'EOF'
# File Organization Configuration
ORGANIZE_DOWNLOADS=true
ORGANIZE_DESKTOP=true
AUTO_ARCHIVE_DAYS=365
DUPLICATE_ACTION=ask # ask, delete, link
BACKUP_ENABLED=true
BACKUP_DIR="$HOME/Backup"
# File type definitions
DOCUMENTS="pdf doc docx txt rtf odt"
IMAGES="jpg jpeg png gif bmp tiff svg"
VIDEOS="mp4 avi mkv mov wmv flv webm"
MUSIC="mp3 flac wav ogg m4a aac"
ARCHIVES="zip tar gz bz2 xz rar 7z"
CODE="py js html css java cpp c php"
DATA="csv json xml sql db sqlite"
EOF
# Load configuration
source "$CONFIG_FILE"
# Logging function
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
echo "$1"
}
# Create directory structure
create_structure() {
local base_dir="$1"
log_message "Creating directory structure in $base_dir"
mkdir -p "$base_dir"/{Documents,Images,Videos,Music,Archives,Code,Data,Misc}
mkdir -p "$base_dir"/Archive/{$(date +%Y),$(date -d 'last year' +%Y)}
mkdir -p "$base_dir"/Backup
log_message "Directory structure created"
}
# Get file category
get_file_category() {
local file="$1"
local extension="${file##*.}"
extension=$(echo "$extension" | tr '[:upper:]' '[:lower:]')
case " $DOCUMENTS " in *" $extension "*) echo "Documents"; return ;; esac
case " $IMAGES " in *" $extension "*) echo "Images"; return ;; esac
case " $VIDEOS " in *" $extension "*) echo "Videos"; return ;; esac
case " $MUSIC " in *" $extension "*) echo "Music"; return ;; esac
case " $ARCHIVES " in *" $extension "*) echo "Archives"; return ;; esac
case " $CODE " in *" $extension "*) echo "Code"; return ;; esac
case " $DATA " in *" $extension "*) echo "Data"; return ;; esac
echo "Misc"
}
# Organize files by type
organize_by_type() {
local source_dir="$1"
local target_dir="$2"
log_message "Organizing files from $source_dir to $target_dir"
create_structure "$target_dir"
local moved_count=0
find "$source_dir" -maxdepth 1 -type f | while read -r file; do
if [[ -f "$file" ]]; then
local filename=$(basename "$file")
local category=$(get_file_category "$filename")
local target_file="$target_dir/$category/$filename"
# Handle name conflicts
local counter=1
local base_name="${filename%.*}"
local extension="${filename##*.}"
while [[ -e "$target_file" ]]; do
if [[ "$extension" != "$filename" ]]; then
target_file="$target_dir/$category/${base_name}_${counter}.${extension}"
else
target_file="$target_dir/$category/${filename}_${counter}"
fi
((counter++))
done
if mv "$file" "$target_file"; then
log_message "Moved $filename to $category/"
((moved_count++))
else
log_message "Failed to move $filename"
fi
fi
done
log_message "Organized $moved_count files"
}
# Find and handle duplicates
handle_duplicates() {
local directory="$1"
local action="${2:-$DUPLICATE_ACTION}"
log_message "Finding duplicates in $directory"
if ! command -v fdupes &> /dev/null; then
log_message "Installing fdupes for duplicate detection"
sudo apt update && sudo apt install -y fdupes
fi
local duplicates_file="/tmp/duplicates_$$"
fdupes -r "$directory" > "$duplicates_file"
local duplicate_sets=0
local files_processed=0
while IFS= read -r line; do
if [[ -z "$line" ]]; then
((duplicate_sets++))
elif [[ -f "$line" ]]; then
case "$action" in
"delete")
if [[ $files_processed -gt 0 ]]; then
rm "$line"
log_message "Deleted duplicate: $line"
fi
;;
"link")
if [[ $files_processed -gt 0 ]]; then
local first_file=$(head -1 "$duplicates_file")
rm "$line"
ln "$first_file" "$line"
log_message "Linked duplicate: $line"
fi
;;
"ask")
if [[ $files_processed -gt 0 ]]; then
echo "Duplicate found: $line"
read -p "Delete this file? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
rm "$line"
log_message "User deleted duplicate: $line"
fi
fi
;;
esac
((files_processed++))
fi
done < "$duplicates_file"
rm "$duplicates_file"
log_message "Processed $duplicate_sets sets of duplicates"
}
# Archive old files
archive_old_files() {
local source_dir="$1"
local days="${2:-$AUTO_ARCHIVE_DAYS}"
log_message "Archiving files older than $days days from $source_dir"
local archive_dir="$source_dir/Archive/$(date +%Y)"
mkdir -p "$archive_dir"
local archived_count=0
find "$source_dir" -type f -mtime +$days ! -path "*/Archive/*" ! -path "*/Backup/*" | while read -r file; do
local relative_path="${file#$source_dir/}"
local archive_path="$archive_dir/$relative_path"
local archive_parent=$(dirname "$archive_path")
mkdir -p "$archive_parent"
if mv "$file" "$archive_path"; then
log_message "Archived: $relative_path"
((archived_count++))
fi
done
log_message "Archived $archived_count files"
}
# Create backup
create_backup() {
local source_dir="$1"
local backup_dir="${2:-$BACKUP_DIR}"
if [[ "$BACKUP_ENABLED" != "true" ]]; then
log_message "Backup disabled in configuration"
return
fi
log_message "Creating backup of $source_dir to $backup_dir"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_target="$backup_dir/backup_$timestamp"
mkdir -p "$backup_target"
if rsync -av --progress "$source_dir/" "$backup_target/"; then
log_message "Backup created successfully: $backup_target"
# Compress backup
if command -v tar &> /dev/null; then
tar -czf "$backup_target.tar.gz" -C "$backup_dir" "backup_$timestamp"
rm -rf "$backup_target"
log_message "Backup compressed: $backup_target.tar.gz"
fi
else
log_message "Backup failed"
fi
}
# Generate organization report
generate_report() {
local directory="$1"
local report_file="/tmp/organization_report_$(date +%Y%m%d).txt"
{
echo "File Organization Report"
echo "========================"
echo "Generated: $(date)"
echo "Directory: $directory"
echo ""
echo "Directory Structure:"
tree -d -L 2 "$directory" 2>/dev/null || find "$directory" -type d | head -20
echo ""
echo "File Count by Type:"
for type in Documents Images Videos Music Archives Code Data Misc; do
if [[ -d "$directory/$type" ]]; then
count=$(find "$directory/$type" -type f | wc -l)
echo "$type: $count files"
fi
done
echo ""
echo "Storage Usage:"
du -sh "$directory"/* 2>/dev/null | sort -hr
echo ""
echo "Recent Activity (from log):"
tail -20 "$LOG_FILE"
} > "$report_file"
echo "Report generated: $report_file"
# Email report if mail is configured
if command -v mail &> /dev/null; then
cat "$report_file" | mail -s "File Organization Report" "$USER@localhost"
fi
}
# Main execution
main() {
case "${1:-help}" in
"organize")
organize_by_type "${2:-$HOME/Downloads}" "${3:-$HOME/Organized}"
;;
"duplicates")
handle_duplicates "${2:-$HOME}" "${3:-ask}"
;;
"archive")
archive_old_files "${2:-$HOME}" "${3:-365}"
;;
"backup")
create_backup "${2:-$HOME}" "${3:-$BACKUP_DIR}"
;;
"report")
generate_report "${2:-$HOME}"
;;
"full")
# Full organization workflow
local target_dir="${2:-$HOME/Organized}"
log_message "Starting full organization workflow"
create_backup "$HOME/Downloads"
organize_by_type "$HOME/Downloads" "$target_dir"
handle_duplicates "$target_dir" "ask"
archive_old_files "$target_dir"
generate_report "$target_dir"
log_message "Full organization workflow completed"
;;
"config")
nano "$CONFIG_FILE"
;;
"help"|*)
echo "Advanced File Organizer for Ubuntu 22.04"
echo "Usage: $0 [command] [options]"
echo ""
echo "Commands:"
echo " organize [source] [target] - Organize files by type"
echo " duplicates [dir] [action] - Handle duplicate files (ask/delete/link)"
echo " archive [dir] [days] - Archive old files"
echo " backup [source] [target] - Create backup"
echo " report [directory] - Generate organization report"
echo " full [target] - Run complete organization workflow"
echo " config - Edit configuration"
echo ""
echo "Examples:"
echo " $0 organize ~/Downloads ~/Organized"
echo " $0 duplicates ~/Documents delete"
echo " $0 archive ~/Documents 180"
echo " $0 full ~/MyFiles"
;;
esac
}
# Run main function with all arguments
main "$@"
Best Practices for Data Organization
Strategic Planning
Data Classification Strategy:
Identify data types and access patterns
Define retention policies
Establish backup requirements
Plan for data growth
Directory Structure Design:
Use consistent naming conventions
Implement logical hierarchies
Plan for scalability
Document organization rules
Automation and Maintenance
Automated Organization:
# Daily organization cron job 0 2 * * * /home/user/bin/organize_data.sh # Weekly duplicate cleanup 0 1 * * 0 fdupes -r -d /home/user/Documents # Monthly archival 0 0 1 * * /home/user/bin/archive_old_files.sh
Monitoring and Alerts:
# Storage usage monitoring if [ $(df / | tail -1 | awk '{print $5}' | sed 's/%//') -gt 85 ]; then echo "Disk usage critical" | mail -s "Storage Alert" admin fi
Security and Compliance
Data Protection:
Implement access controls
Use encryption for sensitive data
Regular security audits
Backup verification
Compliance Requirements:
Document data handling procedures
Implement retention policies
Audit trail maintenance
Regular compliance reviews
# Set up secure data handling
# Encrypt sensitive directories
sudo apt install ecryptfs-utils
ecryptfs-migrate-home -u username
# Set appropriate permissions
find /home/user/Documents -type f -exec chmod 640 {} \;
find /home/user/Documents -type d -exec chmod 750 {} \;